Morning everyone,
I hope people are waking up by now ;-)
So, what is the take on "multi-path IO" (in particular, storage) in 2.5/2.6?
Right now, we have md multipathing in 2.4 (+ an enhancement to that one by
Jens Axboe and myself, which however was ignored on l-k ;-), an enhancement to
LVM1 and various hardware-specific and thus obviously wrong approaches.
I am looking at what to do for 2.5. I have considered porting the small
changes from 2.4 to md 2.5. The LVM1 changes are probably and out gone, as
LVM1 doesn't work still.
I noticed that EVMS duplicates the entire md layer internally (great way to
code, really!), so that might also require changing if I update the md code.
Or can the LVM2 device-mapper be used to do that more cleanly?
I wonder whether anyone has given this some thought already.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
On Mon, 2002-09-09 at 11:49, Lars Marowsky-Bree wrote:
> Or can the LVM2 device-mapper be used to do that more cleanly?
>
> I wonder whether anyone has given this some thought already.
The md layer code can already do the job fine - but it does need to get
to the point that the block layer provides better error information
upstream so it can make better decisions.
LVM2 is a nice clean remapper, so it should sit on top of the md or
other failover mappers easily enough. You can probably do failover by
updating map tables too.
Its nice clean code unlike EVMS, and doesnt duplicate half the kernel
so its easier to hack on
Lars Marowsky-Bree <[email protected]> said:
> So, what is the take on "multi-path IO" (in particular, storage) in
> 2.5/2.6?
I've already made my views on this fairly clear (at least from the SCSI stack
standpoint):
- multi-path inside the low level drivers (like qla2x00) is wrong
- multi-path inside the SCSI mid-layer is probably wrong
- from the generic block layer on up, I hold no specific preferences
That being said, I'm not particularly happy to have the multi-path solution
tied to a specific raid driver; I'd much rather it were in something generic
that could be made use of by all raid drivers (and yes, I do see the LVM2
device mapper as more hopeful than md in this regard).
> I am looking at what to do for 2.5. I have considered porting the
> small changes from 2.4 to md 2.5. The LVM1 changes are probably and
> out gone, as LVM1 doesn't work still.
Well, neither of the people most involved in the development (that's Neil
Brown for md in general and Ingo Molnar for the multi-path enhancements) made
any comments---see if you can elicit some feedback from either of them.
James
On Mon, Sep 09, 2002 at 09:57:56AM -0500, James Bottomley wrote:
> Lars Marowsky-Bree <[email protected]> said:
> > So, what is the take on "multi-path IO" (in particular, storage) in
> > 2.5/2.6?
>
> I've already made my views on this fairly clear (at least from the SCSI stack
> standpoint):
>
> - multi-path inside the low level drivers (like qla2x00) is wrong
Agreed.
> - multi-path inside the SCSI mid-layer is probably wrong
Disagree
> - from the generic block layer on up, I hold no specific preferences
Using md or volume manager is wrong for non-failover usage, and somewhat
bad for failover models; generic block layer is OK but it is wasted code
for any lower layers that do not or cannot have multi-path IO (such as
IDE).
>
> James
I have a newer version of SCSI multi-path in the mid layer that I hope to
post this week, the last version patched against 2.5.14 is here:
http://www-124.ibm.com/storageio/multipath/scsi-multipath
Some documentation is located here:
http://www-124.ibm.com/storageio/multipath/scsi-multipath/docs/index.html
I have a current patch against 2.5.33 that includes NUMA support (it
requires discontigmem support that I believe is in the current linus
bk tree, plus NUMA topology patches).
A major problem with multi-path in md or other volume manager is that we
use multiple (block layer) queues for a single device, when we should be
using a single queue. If we want to use all paths to a device (i.e. round
robin across paths or such, not a failover model) this means the elevator
code becomes inefficient, mabye even counterproductive. For disk arrays,
this might not be bad, but for actual drives or even plugging single
ported drives into a switch or bus with multiple initiators, this could
lead to slower disk performance.
If the volume manager implements only a failover model (use only a single
path until that path fails), besides performance issues in balancing IO
loads, we waste space allocating an extra Scsi_Device for each path.
In the current code, each path is allocated a Scsi_Device, including a
request_queue_t, and a set of Scsi_Cmnd structures. Not only do we
end up with a Scsi_Device for each path, we also have an upper level
(sd, sg, st, or sr) driver attached to each Scsi_Device.
For sd, this means if you have n paths to each SCSI device, you are
limited to whatever limit sd has divided by n, right now 128 / n. Having
four paths to a device is very reasonable, limiting us to 32 devices, but
with the overhead of 128 devices.
Using a volume manager to implement multiple paths (again non-failover
model) means that the queue_depth might be too large if the queue_depth
(i.e. number of outstanding commands sent to the drive) is set as a
per-device value - we can end sending n * queue_depth commands to a device.
multi-path in the scsi layer enables multi-path use for all upper level scsi
drivers, not just disks.
We could implement multi-path IO in the block layer, but if the only user
is SCSI, this gains nothing compared to putting multi-path in the scsi
layers. Creating block level interfaces that will work for future devices
and/or future code is hard without already having the devices or code in
place. Any block level interface still requires support in the the
underlying layers.
I'm not against a block level interface, but I don't have ideas or code
for such an implementation.
Generic device naming consistency is a problem if multiple devices show up
with the same id.
With the scsi layer multi-path, ide-scsi or usb-scsi could also do
multi-path IO.
-- Patrick Mansfield
[email protected] said:
> Using md or volume manager is wrong for non-failover usage, and
> somewhat bad for failover models; generic block layer is OK but it is
> wasted code for any lower layers that do not or cannot have multi-path
> IO (such as IDE).
What about block devices that could usefully use multi-path to achieve network
redundancy, like nbd? If it's in the block layer or above, they can be made to
work with minimal effort.
My basic point is that the utility of the feature transcends SCSI, so SCSI is
too low a layer for it.
I wouldn't be too sure even of the IDE case: IDE has a habit of copying SCSI
features when they become more main-stream (and thus cheaper). It wouldn't
suprise me to see multi-path as an adjunct to the IDE serial stuff.
> A major problem with multi-path in md or other volume manager is that
> we use multiple (block layer) queues for a single device, when we
> should be using a single queue. If we want to use all paths to a
> device (i.e. round robin across paths or such, not a failover model)
> this means the elevator code becomes inefficient, mabye even
> counterproductive. For disk arrays, this might not be bad, but for
> actual drives or even plugging single ported drives into a switch or
> bus with multiple initiators, this could lead to slower disk
> performance.
That's true today, but may not be true in 2.6. Suparna's bio splitting code
is aimed precisely at this and other software RAID cases.
> In the current code, each path is allocated a Scsi_Device, including a
> request_queue_t, and a set of Scsi_Cmnd structures. Not only do we end
> up with a Scsi_Device for each path, we also have an upper level (sd,
> sg, st, or sr) driver attached to each Scsi_Device.
You can't really get away from this. Transfer parameters are negotiated at
the Scsi_Device level (i.e. per device path from HBA to controller), and LLDs
accept I/O's for Scsi_Devices. Whatever you do, you still need an entity that
performs most of the same functions as the Scsi_Device, so you might as well
keep Scsi_Device itself, since it works.
> For sd, this means if you have n paths to each SCSI device, you are
> limited to whatever limit sd has divided by n, right now 128 / n.
> Having four paths to a device is very reasonable, limiting us to 32
> devices, but with the overhead of 128 devices.
I really don't expect this to be true in 2.6.
> Using a volume manager to implement multiple paths (again non-failover
> model) means that the queue_depth might be too large if the
> queue_depth (i.e. number of outstanding commands sent to the drive)
> is set as a per-device value - we can end sending n * queue_depth
> commands to a device.
The queues tend to be in the controllers, not in the RAID devices, thus for a
dual path RAID device you usually have two caching controllers and thus twice
the queue depth (I know this isn't always the case, but it certainly is enough
of the time for me to argue that you should have the flexibility to queue per
path).
> We could implement multi-path IO in the block layer, but if the only
> user is SCSI, this gains nothing compared to putting multi-path in the
> scsi layers. Creating block level interfaces that will work for future
> devices and/or future code is hard without already having the devices
> or code in place. Any block level interface still requires support in
> the the underlying layers.
> I'm not against a block level interface, but I don't have ideas or
> code for such an implementation.
SCSI got into a lot of trouble by going down the "kernel doesn't have X
feature I need, so I'll just code it into the SCSI mid-layer instead", I'm
loth to accept something into SCSI that I don't think belongs there in the
long term.
Answer me this question:
- In the forseeable future does multi-path have uses other than SCSI?
I've got to say, I can't see a "no" to that one, so it fails the high level
bar to getting into the scsi subsystem. However, the kernel, as has been said
before, isn't a theoretical excercise in design, so is there a good expediency
argument (like "it will take one year to get all the features of the block
layer to arrive and I have a customer now"). Also, to go in under expediency,
the code must be readily removable against the day it can be redone correctly.
> Generic device naming consistency is a problem if multiple devices
> show up with the same id.
Patrick Mochel has an open task to come up with a solution to this.
> With the scsi layer multi-path, ide-scsi or usb-scsi could also do
> multi-path IO.
The "scsi is everything" approach got its wings shot off at the kernel summit,
and subsequently confirmed its death in a protracted wrangle on lkml (I can't
remember the reference off the top of my head, but I'm sure others can).
James
James Bottomley wrote:
>Answer me this question:
>- In the forseeable future does multi-path have uses other than SCSI?
The S/390 DASD driver could conceivably make use of generic block layer
(or higher-up) multi-path support.
(We have multi-path support on a lower level, the channel subsystem,
but this helps only with reliability / failover. Using multi-path
support on a higher level for performance reasons would be helpful
in certain scenarios.)
Mit freundlichen Gruessen / Best Regards
Ulrich Weigand
--
Dr. Ulrich Weigand
Linux for S/390 Design & Development
IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
Phone: +49-7031/16-3727 --- Email: [email protected]
James Bottomley [[email protected]] wrote:
> [email protected] said:
> > Using md or volume manager is wrong for non-failover usage, and
> > somewhat bad for failover models; generic block layer is OK but it is
> > wasted code for any lower layers that do not or cannot have multi-path
> > IO (such as IDE).
>
> What about block devices that could usefully use multi-path to achieve network
> redundancy, like nbd? If it's in the block layer or above, they can be made to
> work with minimal effort.
When you get into networking I believe we may get into path failover
capability that is already implemented by the network stack. So the
paths may not be visible to the block layer.
>
> My basic point is that the utility of the feature transcends SCSI, so SCSI is
> too low a layer for it.
>
> I wouldn't be too sure even of the IDE case: IDE has a habit of copying SCSI
> features when they become more main-stream (and thus cheaper). It wouldn't
> suprise me to see multi-path as an adjunct to the IDE serial stuff.
>
The utility does transcend SCSI, but transport / device specific
characteristics may make "true" generic implementations difficult.
To add functionality beyond failover multi-path you will need to get into
transport and device specific data gathering.
> > A major problem with multi-path in md or other volume manager is that
> > we use multiple (block layer) queues for a single device, when we
> > should be using a single queue. If we want to use all paths to a
> > device (i.e. round robin across paths or such, not a failover model)
> > this means the elevator code becomes inefficient, mabye even
> > counterproductive. For disk arrays, this might not be bad, but for
> > actual drives or even plugging single ported drives into a switch or
> > bus with multiple initiators, this could lead to slower disk
> > performance.
>
> That's true today, but may not be true in 2.6. Suparna's bio splitting code
> is aimed precisely at this and other software RAID cases.
I have not looked at Suparna's patch but it would seem that device
knowledge would be helpful for knowing when to split.
> > In the current code, each path is allocated a Scsi_Device, including a
> > request_queue_t, and a set of Scsi_Cmnd structures. Not only do we end
> > up with a Scsi_Device for each path, we also have an upper level (sd,
> > sg, st, or sr) driver attached to each Scsi_Device.
>
> You can't really get away from this. Transfer parameters are negotiated at
> the Scsi_Device level (i.e. per device path from HBA to controller), and LLDs
> accept I/O's for Scsi_Devices. Whatever you do, you still need an entity that
> performs most of the same functions as the Scsi_Device, so you might as well
> keep Scsi_Device itself, since it works.
James have you looked at the documentation / patch previously pointed to
by Patrick? There is still a Scsi_device.
>
> > For sd, this means if you have n paths to each SCSI device, you are
> > limited to whatever limit sd has divided by n, right now 128 / n.
> > Having four paths to a device is very reasonable, limiting us to 32
> > devices, but with the overhead of 128 devices.
>
> I really don't expect this to be true in 2.6.
>
While the device space may be increased in 2.6 you are still consuming
extra resources, but we do this in other places also.
> > We could implement multi-path IO in the block layer, but if the only
> > user is SCSI, this gains nothing compared to putting multi-path in the
> > scsi layers. Creating block level interfaces that will work for future
> > devices and/or future code is hard without already having the devices
> > or code in place. Any block level interface still requires support in
> > the the underlying layers.
>
> > I'm not against a block level interface, but I don't have ideas or
> > code for such an implementation.
>
> SCSI got into a lot of trouble by going down the "kernel doesn't have X
> feature I need, so I'll just code it into the SCSI mid-layer instead", I'm
> loth to accept something into SCSI that I don't think belongs there in the
> long term.
>
> Answer me this question:
>
> - In the forseeable future does multi-path have uses other than SCSI?
>
See top comment.
> The "scsi is everything" approach got its wings shot off at the kernel summit,
> and subsequently confirmed its death in a protracted wrangle on lkml (I can't
> remember the reference off the top of my head, but I'm sure others can).
Could you point this out so I can understand the context.
-Mike
--
Michael Anderson
[email protected]
James -
On Mon, Sep 09, 2002 at 12:34:05PM -0500, James Bottomley wrote:
> What about block devices that could usefully use multi-path to achieve network
> redundancy, like nbd? If it's in the block layer or above, they can be made to
> work with minimal effort.
>
> My basic point is that the utility of the feature transcends SCSI, so SCSI is
> too low a layer for it.
I agree it has potential uses outside of SCSI, this does not directly
imply that we need to create a generic implementation. I have found no
code to reference in other block drivers or in the block layer. I've
looked some at the dasd code but can't figure out if or where there is any
multi-path code.
Putting multi-path into the block layer means it would have to acquire and
maintain a handle (i.e. path) for each device it knows about, and then
eventually pass this handle down to the lower level. I don't see this
happening in 2.5/2.6, unless someone is coding it right now.
It makes sense to at least expose the topology of the IO storage, whether
or not the block or other layers can figure out what to do with the
information. That is, ideally for SCSI we should have a representation of
the target - like struct scsi_target - and then the target is multi-pathed,
not the devices (LUNs, block or character devices) attached to the target.
We should also have a bus or fabric representation, showing multi-path from
the adapters view (multiple initiators on the fabric or bus).
Whether or not the fabric or target information is used to route IO, they
are useful for hardware removal/replacement. Imagine replacing a fibre
switch, or replacing a failed controller on a raid array.
If all this information was in the device model (driver?), with some sort
of function or data pointers, perhaps (in 2.7.x timeframe) we could route
IO and call appropriate drivers based on that information.
> > A major problem with multi-path in md or other volume manager is that
> > we use multiple (block layer) queues for a single device, when we
> > should be using a single queue. If we want to use all paths to a
> > device (i.e. round robin across paths or such, not a failover model)
> > this means the elevator code becomes inefficient, mabye even
> > counterproductive. For disk arrays, this might not be bad, but for
> > actual drives or even plugging single ported drives into a switch or
> > bus with multiple initiators, this could lead to slower disk
> > performance.
>
> That's true today, but may not be true in 2.6. Suparna's bio splitting code
> is aimed precisely at this and other software RAID cases.
Yes, but then we need some sort of md/RAID/volume manager aware eleavator
code + bio splitting, and perhaps avoid calling elevator code normally called
for a Scsi_Device. Though I can imagine splitting the bio in md and then
still merging and sorting requests for SCSI.
> > In the current code, each path is allocated a Scsi_Device, including a
> > request_queue_t, and a set of Scsi_Cmnd structures. Not only do we end
> > up with a Scsi_Device for each path, we also have an upper level (sd,
> > sg, st, or sr) driver attached to each Scsi_Device.
>
> You can't really get away from this. Transfer parameters are negotiated at
> the Scsi_Device level (i.e. per device path from HBA to controller), and LLDs
> accept I/O's for Scsi_Devices. Whatever you do, you still need an entity that
> performs most of the same functions as the Scsi_Device, so you might as well
> keep Scsi_Device itself, since it works.
Yes negotiation is at the adapter level, but that does not have to be tied
to a Scsi_Device. I need to search for Scsi_Device::hostdata usage to
figure out details, and to figure out if anything is broken in the current
scsi multi-path code - right now it requires the same adapter drivers be
used and that certain Scsi_Host parameters are equal if multiple paths
to a Scsi_Device are found.
> > For sd, this means if you have n paths to each SCSI device, you are
> > limited to whatever limit sd has divided by n, right now 128 / n.
> > Having four paths to a device is very reasonable, limiting us to 32
> > devices, but with the overhead of 128 devices.
>
> I really don't expect this to be true in 2.6.
If we use a Scsi_Device for each path, we always have the overhead of the
number of paths times the number of devices - upping the limits of sd
certainly helps, but we are then increasing the possibly large amount
of memory that we can waste. And, other devices besides disks can be
multi-pathed.
> > Using a volume manager to implement multiple paths (again non-failover
> > model) means that the queue_depth might be too large if the
> > queue_depth (i.e. number of outstanding commands sent to the drive)
> > is set as a per-device value - we can end sending n * queue_depth
> > commands to a device.
>
> The queues tend to be in the controllers, not in the RAID devices, thus for a
> dual path RAID device you usually have two caching controllers and thus twice
> the queue depth (I know this isn't always the case, but it certainly is enough
> of the time for me to argue that you should have the flexibility to queue per
> path).
You can have multiple initiators on FCP or SPI, without dual controllers
involved at all. Most of my multi-path testing has been with dual
ported FCP disk drives, with multiple FCP adapters connected to a
switch, not with disk arrays (I don't have any non-failover multi-ported
disk arrays available, I'm using a fastt 200 disk array); I don't know the
details of the drive controllers for my disks, but putting multiple
controllers in a disk drive certainly would increase the cost.
Yes, per path queues and per device queues are reasonable; per path queues
requires knowledge of actual device ports not in the current scsi multi-path
patch. The code I have now uses the Scsi_Host::can_queue to limit the number
of commands sent to a host. I really need slave_attach() support in the host
adapter (like Doug L's patch a while back), plus maybe a slave_attach_path(),
and/or queue limit per path.
Per path queues are not required, as long as any queue limits do not
hinder the performance.
> SCSI got into a lot of trouble by going down the "kernel doesn't have X
> feature I need, so I'll just code it into the SCSI mid-layer instead", I'm
> loth to accept something into SCSI that I don't think belongs there in the
> long term.
>
> Answer me this question:
>
> - In the forseeable future does multi-path have uses other than SCSI?
>
> I've got to say, I can't see a "no" to that one, so it fails the high level
> bar to getting into the scsi subsystem. However, the kernel, as has been said
> before, isn't a theoretical excercise in design, so is there a good expediency
> argument (like "it will take one year to get all the features of the block
> layer to arrive and I have a customer now"). Also, to go in under expediency,
> the code must be readily removable against the day it can be redone correctly.
Yes, there could be future multi-path users, or maybe with DASD. If we take
SCSI and DASD as existing usage, they could be a basis for a block layer
(or generic) set of multi-path interfaces.
There is code available for scsi multi-path, this is not a design in theory.
Anyone can take the code and fold it into a block layer implementation or
other approach. I would be willing to work on scsi usage or such for any
new block level or other such code for generic multi-path use. At this
time I wouldn't feel comfortable adding to or modifying block layer
interfaces and code, nor do I think it is possible to come up with the
best interface given only one block driver implementation, nor do I think
there is enough time to get this into 2.5.x.
IMO, there is demand for scsi multi-path support now, as users move to
large databases requiring higher availabitity. md or volume manager
for failover is adequate in some of these cases.
I see other issues as being more important to scsi - like cleaning it up
or rewriting portions of the code, but we still need to add new features
as we move forward.
Even with generic block layer multi-path support, we still need block
driver (scsi, ide, etc.) code for multi-path.
> > Generic device naming consistency is a problem if multiple devices
> > show up with the same id.
>
> Patrick Mochel has an open task to come up with a solution to this.
I don't think this can be solved if multiple devices show up with the same
id. If I have five disks that all say I'm disk X, how can there be one
name or handle for it from user level?
> > With the scsi layer multi-path, ide-scsi or usb-scsi could also do
> > multi-path IO.
>
> The "scsi is everything" approach got its wings shot off at the kernel summit,
> and subsequently confirmed its death in a protracted wrangle on lkml (I can't
> remember the reference off the top of my head, but I'm sure others can).
Agreed, but having the block layer be everything is also wrong.
My view is that md/volume manager multi-pathing is useful with 2.4.x, scsi
layer multi-path for 2.5.x, and this (perhaps with DASD) could then evolve
into generic block level (or perhaps integrated with the device model)
multi-pathing support for use in 2.7.x. Do you agree or disagree with this
approach?
-- Patrick Mansfield
On Sep 9, 5:08pm, Patrick Mansfield wrote:
>
> You can have multiple initiators on FCP or SPI, without dual controllers
> involved at all. Most of my multi-path testing has been with dual
> ported FCP disk drives, with multiple FCP adapters connected to a
> switch, not with disk arrays (I don't have any non-failover multi-ported
> disk arrays available, I'm using a fastt 200 disk array); I don't know the
> details of the drive controllers for my disks, but putting multiple
> controllers in a disk drive certainly would increase the cost.
Is there any plan to do something for hardware RAIDs in which two different
RAID controllers can get to the same logical unit, but you pay a performance
penalty when you access the lun via both controllers? It seems to me that
all RAIDs that don't require a command to switch a lun from one to the
other controller (i.e. where both ctlrs can access a lun simultaneously)
pay a performance penalty when you access a lun from both.
Working around this in a generic way (i.e. without designation by the
system admin) seems difficult, so I'm wondering what may have been done
with this (my reading of this discussion is that it has not been tackled
yet).
thanks
jeremy
On Mon, Sep 09, 2002 at 12:49:44PM +0200, Lars Marowsky-Bree wrote:
> Morning everyone,
>
> I hope people are waking up by now ;-)
>
> So, what is the take on "multi-path IO" (in particular, storage) in 2.5/2.6?
>
> Right now, we have md multipathing in 2.4 (+ an enhancement to that one by
> Jens Axboe and myself, which however was ignored on l-k ;-), an enhancement to
> LVM1 and various hardware-specific and thus obviously wrong approaches.
>
> I am looking at what to do for 2.5. I have considered porting the small
> changes from 2.4 to md 2.5. The LVM1 changes are probably and out gone, as
> LVM1 doesn't work still.
>
> I noticed that EVMS duplicates the entire md layer internally (great way to
> code, really!), so that might also require changing if I update the md code.
>
> Or can the LVM2 device-mapper be used to do that more cleanly?
We have a multi-path target for device-mapper planned for later this year.
This will be a multi-path addon to the generic mapping service(s)
device-mapper already provides which can multi-path access to any
arbitrary given block device, not just logical volumes.
>
> I wonder whether anyone has given this some thought already.
We did ;)
>
>
> Sincerely,
> Lars Marowsky-Br?e <[email protected]>
>
> --
> Immortality is an adequate definition of high availability for me.
> --- Gregory F. Pfister
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Regards,
Heinz -- The LVM Guy --
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Heinz Mauelshagen Sistina Software Inc.
Senior Consultant/Developer Am Sonnenhang 11
56242 Marienrachdorf
Germany
[email protected] +49 2626 141200
FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
On 2002-09-10T00:55:58,
Jeremy Higdon <[email protected]> said:
> Is there any plan to do something for hardware RAIDs in which two different
> RAID controllers can get to the same logical unit, but you pay a performance
> penalty when you access the lun via both controllers?
This is implemented in the md multipath patch in 2.4; it distinguishes between
"active" and "spare" paths.
The LVM1 patch also does this by having priorities for each path and only
going to the next priority group if all paths in the current one have failed,
which IMHO is slightly over the top but there is always someone who might need
it ;-)
This functionality is a generic requirement IMHO.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
On 2002-09-09T17:08:47,
Patrick Mansfield <[email protected]> said:
Patrick, I am only replying to what I understand. Some of your comments on the
internals of the SCSI layer are beyond me ;-)
> Yes negotiation is at the adapter level, but that does not have to be tied
> to a Scsi_Device. I need to search for Scsi_Device::hostdata usage to
> figure out details, and to figure out if anything is broken in the current
> scsi multi-path code - right now it requires the same adapter drivers be
> used and that certain Scsi_Host parameters are equal if multiple paths
> to a Scsi_Device are found.
This seems to be a serious limitation. There are good reasons for wanting to
use different HBAs for the different paths.
And the Scsi_Device might be quite different. Imagine something like two
storage boxes which do internal replication among them; yes, you'd only want
to use one of them normal (because the Cache-coherency traffic is going to
kill performance otherwise), but you can failover from one to the other even
if they have different SCSI serials etc.
> of memory that we can waste. And, other devices besides disks can be
> multi-pathed.
That is a good point.
But it would also be true for a generic block device implementation.
> Yes, there could be future multi-path users, or maybe with DASD. If we take
> SCSI and DASD as existing usage, they could be a basis for a block layer
> (or generic) set of multi-path interfaces.
SATA will also support multipathing if the birds were right, so it might make
sense to keep this in mind, at least for 2.7.
> There is code available for scsi multi-path, this is not a design in theory.
Well, there is code available for all the others too ;-)
> IMO, there is demand for scsi multi-path support now, as users move to
> large databases requiring higher availabitity. md or volume manager
> for failover is adequate in some of these cases.
The volume manager multi-pathing, at least as done via the LVM1 patch, has a
major drawback. It can't easily be stacked with software RAID. It is very
awkward to do that right now.
And software RAID on top of multi-pathing is a typical example for a truely
fault tolerant configuration.
Thats obviously easier with md, and I assume your SCSI code can also do that
nicely.
> Even with generic block layer multi-path support, we still need block
> driver (scsi, ide, etc.) code for multi-path.
Yes. Error handling in particular ;-)
The topology information you mention is also a good candidate for exposure.
> Agreed, but having the block layer be everything is also wrong.
Having the block device handling all block devices seems fairly reasonable to
me.
> My view is that md/volume manager multi-pathing is useful with 2.4.x, scsi
> layer multi-path for 2.5.x, and this (perhaps with DASD) could then evolve
> into generic block level (or perhaps integrated with the device model)
> multi-pathing support for use in 2.7.x. Do you agree or disagree with this
> approach?
Well, I guess 2.5/2.6 will have all the different multi-path implementations
mentioned so far (EVMS, LVM2, md, scsi, proprietary) - they all have code and
a userbase... All of them and future implementations can benefit from better
error handling and general cleanup, so that might be the best to do for now.
I think it is too soon to clean that up and consolidate the m-p approaches,
but I think it really ought to be consolidated in 2.7, and this seems like a
good time to start planning for that one.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
On 2002-09-09T11:40:26,
Mike Anderson <[email protected]> said:
> When you get into networking I believe we may get into path failover
> capability that is already implemented by the network stack. So the
> paths may not be visible to the block layer.
It depends. The block layer might need knowledge of the different paths for
load balancing.
> The utility does transcend SCSI, but transport / device specific
> characteristics may make "true" generic implementations difficult.
I disagree. What makes "generic" implementations difficult is the absolutely
mediocre error reporting and handling from the lower layers.
With multipathing, you want the lower level to hand you the error
_immediately_ if there is some way it could be related to a path failure and
no automatic retries should take place - so you can immediately mark the path
as faulty and go to another.
However, on a "access beyond end of device" or a clear read error, failing a
path is a rather stupid idea, but instead the error should go up immediately.
This will need to be sorted regardless of the layer it is implemented in.
How far has this been already addressed in 2.5 ?
> > > For sd, this means if you have n paths to each SCSI device, you are
> > > limited to whatever limit sd has divided by n, right now 128 / n.
> > > Having four paths to a device is very reasonable, limiting us to 32
> > > devices, but with the overhead of 128 devices.
> >
> > I really don't expect this to be true in 2.6.
> >
> While the device space may be increased in 2.6 you are still consuming
> extra resources, but we do this in other places also.
For user-space reprobing of failed paths, it may be vital to expose the
physical paths too. Then "reprobing" could be as simple as
dd if=/dev/physical/path of=/dev/null count=1 && enable_path_again
I dislike reprobing in kernel space, in particular using live requests as
the LVM1 patch by IBM does it.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
James Bottomley wrote:
>Answer me this question:
>- In the forseeable future does multi-path have uses other than SCSI?
We (HP) would like to use multipath i/o with the cciss driver.
(which is a block driver).
We can use the md driver for this. However, we cannot boot from
such a multipath device. Lilo and grub do not understand md multipath
devices, nor do anaconda or other installers. (Enhancing all of those,
I'd like to avoid. Cramming multipath i/o into the low level driver
would accomplish that, but, too yucky.)
If there is work going on to enhance the multipath support in linux
it would be nice if you could boot from and install to such devices.
-- steve
On Tue, 2002-09-10 at 15:06, Cameron, Steve wrote:
> We can use the md driver for this. However, we cannot boot from
> such a multipath device. Lilo and grub do not understand md multipath
> devices, nor do anaconda or other installers. (Enhancing all of those,
> I'd like to avoid. Cramming multipath i/o into the low level driver
> would accomplish that, but, too yucky.)
>
> If there is work going on to enhance the multipath support in linux
> it would be nice if you could boot from and install to such devices.
Booting from them is a BIOS matter. If the BIOS can do the block loads
off the multipath device we can do the rest. The probe stuff can be done
in the boot initrd - as some vendors handle raid for example.
Alan Cox wrote:
> On Tue, 2002-09-10 at 15:06, Cameron, Steve wrote:
> > We can use the md driver for this. However, we cannot boot from
> > such a multipath device. Lilo and grub do not understand md multipath
> > devices, nor do anaconda or other installers. (Enhancing all of those,
> > I'd like to avoid. Cramming multipath i/o into the low level driver
> > would accomplish that, but, too yucky.)
> >
> > If there is work going on to enhance the multipath support in linux
> > it would be nice if you could boot from and install to such devices.
>
> Booting from them is a BIOS matter. If the BIOS can do the block loads
> off the multipath device we can do the rest. The probe stuff
> can be done
> in the boot initrd - as some vendors handle raid for example.
Well, the BIOS can do it if it has one working path, right?
(I think the md information is at the end of the partition,)
Maybe it will have some trouble if the primary path is failed,
but ignoring that for now. In the normal case, the bios shouldn't
even have to know it's a multipath device, right?
lilo and grub as they stand today, and anaconda (et al) as it
stands today, cannot do it. They can do RAID1 md devices only.
lilo for example will complain if you try to run it with
boot=/dev/md0, and /dev/md0 is not a RAID1 device. At least
it did when I tried it. When I looked at the lilo source, it
goes through each disk in the raid device and puts the boot
code on each one, a la RAID1. No provision is made for any other
kind of raid. Raid 5, naturally is much harder, and is unlikely
to have BIOS support, so this is understandable. Multipath seems
much closer to raid1 in terms of what's necessary for booting,
that is, much much easier than raid 5. I tried hacking on lilo a
bit, but so far, I am unsuccessful. I think grub cannot even do
RAID1.
I agree in principle, the initrd scripts can insmod multipath.o
to get things rolling, etc. The trouble comes from lilo, grub
and install time configuration.
-- steve
On Tue, 2002-09-10 at 15:43, Cameron, Steve wrote:
> Well, the BIOS can do it if it has one working path, right?
> (I think the md information is at the end of the partition,)
Yes. A good PC bios will spot an hda boot fail, and try hdc. Good PC
bioses are alas slightly hard to find nowdays. In that set up raid1
works very well. Multipath is obviously a lot more complicated.
> lilo and grub as they stand today, and anaconda (et al) as it
> stands today, cannot do it. They can do RAID1 md devices only.
> lilo for example will complain if you try to run it with
> boot=/dev/md0, and /dev/md0 is not a RAID1 device. At least
It relies on the BIOS to do the data loading off the md0. In your
scenario you would tell it the boot is on /dev/sdfoo where that is the
primary path. I guess the ugly approach would be to add lilo/grub
entries for booting off either path as two seperate kernel entries.
> bit, but so far, I am unsuccessful. I think grub cannot even do
> RAID1.
Works for me
> I agree in principle, the initrd scripts can insmod multipath.o
> to get things rolling, etc. The trouble comes from lilo, grub
> and install time configuration.
Yes.
On Tue, Sep 10, 2002 at 03:02:01PM +0200, Lars Marowsky-Bree wrote:
> On 2002-09-09T11:40:26,
> With multipathing, you want the lower level to hand you the error
> _immediately_ if there is some way it could be related to a path failure and
> no automatic retries should take place - so you can immediately mark the path
> as faulty and go to another.
>
> However, on a "access beyond end of device" or a clear read error, failing a
> path is a rather stupid idea, but instead the error should go up immediately.
>
> This will need to be sorted regardless of the layer it is implemented in.
>
> How far has this been already addressed in 2.5 ?
>
The current scsi multi-path code handles the above cases. There is
a scsi_path_decide_disposition() that fails paths independent of the
result of the IO. It is similiar to the current scsi_decide_disposition,
except it also fails the path. So for a check condition of media error,
the IO might be marked as SUCCESS (meaning completed with an error),
but the path would not be modified (there are more details than this).
> For user-space reprobing of failed paths, it may be vital to expose the
> physical paths too. Then "reprobing" could be as simple as
>
> dd if=/dev/physical/path of=/dev/null count=1 && enable_path_again
>
Yes, that is a good idea, I was thinking that this should be done
via sg, and modify sg to allow path selection - so no matter what, sg
could be used to probe a path. I have no plans to expose a user level
device for each path, but a device model "file" could be exposed for
the state of each path, and its state controlled via driverfs.
> I dislike reprobing in kernel space, in particular using live requests as
> the LVM1 patch by IBM does it.
>
>
> Sincerely,
> Lars Marowsky-Br?e <[email protected]>
-- Patrick Mansfield
On Tue, Sep 10, 2002 at 03:04:27PM +0200, Lars Marowsky-Bree wrote:
> On 2002-09-10T00:55:58,
> Jeremy Higdon <[email protected]> said:
>
> > Is there any plan to do something for hardware RAIDs in which two different
> > RAID controllers can get to the same logical unit, but you pay a performance
> > penalty when you access the lun via both controllers?
>
> This is implemented in the md multipath patch in 2.4; it distinguishes between
> "active" and "spare" paths.
>
> The LVM1 patch also does this by having priorities for each path and only
> going to the next priority group if all paths in the current one have failed,
> which IMHO is slightly over the top but there is always someone who might need
> it ;-)
>
> This functionality is a generic requirement IMHO.
>
>
> Sincerely,
> Lars Marowsky-Br?e <[email protected]>
The current scsi multi-path has a default setting of "last path used", so
that it will always work with such controllers (I refer to such controllers
as fail-over devices). You have to modify the config, or boot with a
flag to get round-robin path selection. Right now, all paths will start
out on the same adapter (initiator), this is bad if you have more than
a few arrays attached to and adapter.
I was planning on implementing something similiar to what you describe
for LVM1, with path weighting. Yes, it seems a bit heavy, but if there
are only two weights or priorities, it looks just like your active/spare
code, and is a bit more flexible.
Figuring out what ports are active or spare is not easy, and varies
from array to array. This is a (another) good canidate for user level
probing/configuration. I will likely probably hard-code device
personallity information into the kernel for now, and hopefully in
2.7 we could move SCSI to user level probe/scan/configure.
I have heard that some arrays at one time had a small penalty for
switching controllers, and it was best to use the last path used,
but it was still OK to use both at the same time (cache warmth). I
was trying to think of a way to allow good load balancing in such
cases, but didn't come up with a good solution. Like use last path
used selection, and once a path is too "busy" move to another path
- but then all devices might switch at the same time; some hearistics
or timing could probably be added to avoid such problems. The code
allows for path selection in a single function, so it should not
be difficult to add more complex path selection.
I have also put NUMA path selection into the latest code. I've tested
it with 2.5.32, and a bunch of NUMA + NUMAQ patches on IBM/Sequent
NUMAQ systems.
-- Patrick Mansfield
Lars Marowsky-Bree [[email protected]] wrote:
> On 2002-09-09T11:40:26,
> Mike Anderson <[email protected]> said:
>
> > When you get into networking I believe we may get into path failover
> > capability that is already implemented by the network stack. So the
> > paths may not be visible to the block layer.
>
> It depends. The block layer might need knowledge of the different paths for
> load balancing.
>
> > The utility does transcend SCSI, but transport / device specific
> > characteristics may make "true" generic implementations difficult.
>
> I disagree. What makes "generic" implementations difficult is the absolutely
> mediocre error reporting and handling from the lower layers.
>
Where working on it. I have done a mid scsi_error cleanup patch for 2.5
so that we can better view where we are at currently. Hopefully soon we
can do some actual useful improvements.
My main point on the previous comment though was that some transports may
decide not to expose there paths (i.e. the may manage them at the
transport layer) and the block layer would be unable to attach to
individual paths.
The second point I was trying to make is that if you look at most
multi-path solutions across different operating systems once they have
failover capability and move to support more performant / advanced
multi-path solutions they need specific attributes of the device or
path. These attributes have sometimes been called personalities.
Examples of these personalities are path usage models (failover,
transparent failover, active load balancing), ports per controller
config, platform memory latency (NUMA), cache characteristics, special
error decoding for device events, etc.
I mention a few in these in this document.
http://www-124.ibm.com/storageio/multipath/scsi-multipath/docs/index.html
These personalities could be acquired at any level of the IO stack, but
involve some API if we want to try and get as close as possible to
"generic".
> With multipathing, you want the lower level to hand you the error
> _immediately_ if there is some way it could be related to a path failure and
> no automatic retries should take place - so you can immediately mark the path
> as faulty and go to another.
>
Agreed, In a past O/S we worked hard to have our error policy structure
so that transports worried about transport errors and devices worried
about device errors. If a transport received a completion of an IO its
job was done (we did have some edge case and heuristics to stop paths
from cycling from disabled to enabled)
> However, on a "access beyond end of device" or a clear read error, failing a
> path is a rather stupid idea, but instead the error should go up immediately.
>
Agreed, each layer should only deal with error in there domain.
> This will need to be sorted regardless of the layer it is implemented in.
>
> How far has this been already addressed in 2.5 ?
See previous comment.
>
> > > > For sd, this means if you have n paths to each SCSI device, you are
> > > > limited to whatever limit sd has divided by n, right now 128 / n.
> > > > Having four paths to a device is very reasonable, limiting us to 32
> > > > devices, but with the overhead of 128 devices.
> > >
> > > I really don't expect this to be true in 2.6.
> > >
> > While the device space may be increased in 2.6 you are still consuming
> > extra resources, but we do this in other places also.
>
> For user-space reprobing of failed paths, it may be vital to expose the
> physical paths too. Then "reprobing" could be as simple as
>
> dd if=/dev/physical/path of=/dev/null count=1 && enable_path_again
>
> I dislike reprobing in kernel space, in particular using live requests as
> the LVM1 patch by IBM does it.
>
In the current mid-mp patch the paths are exposed through the proc
interface and can be activated / state changed through a echo command.
While live requests are sometimes viewed as a bad things to activate a
path very small IO sizes in optical networks a unreliable in
determining anything but completely dead paths. The size of the payload
is important.
I believe the difference in views here is what to expose and what
size/type of structure represents each piece of the nexus (b:c:t:l).
-Mike
--
Michael Anderson
[email protected]
Alan Cox wrote:
> On Tue, 2002-09-10 at 15:43, Cameron, Steve wrote:
> > Well, the BIOS can do it if it has one working path, right?
> > (I think the md information is at the end of the partition,)
>
> Yes. A good PC bios will spot an hda boot fail, and try hdc. Good PC
> bioses are alas slightly hard to find nowdays. In that set up raid1
> works very well. Multipath is obviously a lot more complicated.
For the failed primary path case yes. In the case the primary path is
working, I would think booting from a multipath device and a RAID1
device would be very very similar, or even identical, from the
perspective of the BIOS, right?
> > lilo and grub as they stand today, and anaconda (et al) as it
> > stands today, cannot do it. They can do RAID1 md devices only.
> > lilo for example will complain if you try to run it with
> > boot=/dev/md0, and /dev/md0 is not a RAID1 device. At least
>
> It relies on the BIOS to do the data loading off the md0. In your
> scenario you would tell it the boot is on /dev/sdfoo where that is the
> primary path. I guess the ugly approach would be to add lilo/grub
> entries for booting off either path as two seperate kernel entries.
>
Hmm, I thought I had tried this, but, I had tried so many things.
If anyone has successfully set up a system booting from a multipath
md device, I'd like to hear about it.
What I tried was more or less this:
1. Install normally to disk-A, remove disk-A from the system.
2. Install normally to disk-B. Install disk-A into the system.
and boot from disk-B. (now I can safely copy from disk-A, which has
no actively mounted partitions.)
3. Create multipath devices on disk-C, one for each partition.
The partitions are bigger than those on disk-A, to allow for md to
put its data at the end.
4. Copy, (using dd) the partitions from disk-A to the md devices.
5. mount the md devices, and chroot to the md-root device.
Try to figure out how to run lilo to make booting from disk-C possible
also, initrd modifications to insmod multipath.o, mount the md devices,
etc. This part, (making lilo work) I never was able to figure out.
Would boot up and say "LI" then stop... or various other problems
that I can't quite recall now. (this was a couple weeks ago)
I guess this is getting a little off topic for linux-kernel, so maybe I
should let this drop now, or take it over to linux-raid, but I _would_
like to hear from anyone who has got a multipath boot device working.
Anyway, even if this can be made to work, this leaves a rather ugly and
convoluted method of installation.
> > bit, but so far, I am unsuccessful. I think grub cannot even do
> > RAID1.
>
> Works for me
Ok, good to hear. (I had it only on hearsay that grub couldn't do it,
so my experiments were confined to lilo.)
-- steve
> > > Generic device naming consistency is a problem if multiple devices
> > > show up with the same id.
> >
> > Patrick Mochel has an open task to come up with a solution to this.
>
> I don't think this can be solved if multiple devices show up with the same
> id. If I have five disks that all say I'm disk X, how can there be one
> name or handle for it from user level?
Easy: you map the unique identifier of the device to a name in userspace.
In our utopian future, /sbin/hotplug is called with that unique ID as one
of its parameters. It searches for, and finds names based on the ID is. If
the name(s) already exist, then it doesn't continue.
-pat
On Tue, Sep 10, 2002 at 10:21:53AM -0700, Patrick Mochel wrote:
>
> > > > Generic device naming consistency is a problem if multiple devices
> > > > show up with the same id.
> > >
> > > Patrick Mochel has an open task to come up with a solution to this.
> >
> > I don't think this can be solved if multiple devices show up with the same
> > id. If I have five disks that all say I'm disk X, how can there be one
> > name or handle for it from user level?
>
> Easy: you map the unique identifier of the device to a name in userspace.
> In our utopian future, /sbin/hotplug is called with that unique ID as one
> of its parameters. It searches for, and finds names based on the ID is. If
> the name(s) already exist, then it doesn't continue.
>
>
> -pat
But then if the md or volume manager wants to do multi-path IO it
will not be able to find all of the names in userspace since the
extra ones (second path and on) have been dropped.
-- Patrick Mansfield
On Tue, 2002-09-10 at 17:34, Cameron, Steve wrote:
> 1. Install normally to disk-A, remove disk-A from the system.
> 2. Install normally to disk-B. Install disk-A into the system.
> and boot from disk-B. (now I can safely copy from disk-A, which has
> no actively mounted partitions.)
> 3. Create multipath devices on disk-C, one for each partition.
> The partitions are bigger than those on disk-A, to allow for md to
> put its data at the end.
> 4. Copy, (using dd) the partitions from disk-A to the md devices.
>
> 5. mount the md devices, and chroot to the md-root device.
> Try to figure out how to run lilo to make booting from disk-C possible
> also, initrd modifications to insmod multipath.o, mount the md devices,
> etc. This part, (making lilo work) I never was able to figure out.
>
> Would boot up and say "LI" then stop... or various other problems
> that I can't quite recall now. (this was a couple weeks ago)
Sounds like geometry mismatches. You need to be sure that the BIOS disk
mappings are constant and that the geometries match. The former can be a
big problem. If your failed drive falls off the scsi bus scan then your
BIOS disk 0x80 becomes the second disk not the first and life starts to
go down hill from that moment.
On Tue, 10 Sep 2002, Patrick Mansfield wrote:
> On Tue, Sep 10, 2002 at 10:21:53AM -0700, Patrick Mochel wrote:
> >
> > > > > Generic device naming consistency is a problem if multiple devices
> > > > > show up with the same id.
> > > >
> > > > Patrick Mochel has an open task to come up with a solution to this.
> > >
> > > I don't think this can be solved if multiple devices show up with the same
> > > id. If I have five disks that all say I'm disk X, how can there be one
> > > name or handle for it from user level?
> >
> > Easy: you map the unique identifier of the device to a name in userspace.
> > In our utopian future, /sbin/hotplug is called with that unique ID as one
> > of its parameters. It searches for, and finds names based on the ID is. If
> > the name(s) already exist, then it doesn't continue.
> >
> >
> > -pat
>
> But then if the md or volume manager wants to do multi-path IO it
> will not be able to find all of the names in userspace since the
> extra ones (second path and on) have been dropped.
Which is it that you want? One canonical name or all the paths? I supplied
a solution for the former in my repsonse. The latter is solved via the
exposure of the paths in driverfs, which has been discussed previously.
-pat
On Tue, Sep 10, 2002 at 03:16:06PM +0200, Lars Marowsky-Bree wrote:
> On 2002-09-09T17:08:47,
> Patrick Mansfield <[email protected]> said:
> > Yes negotiation is at the adapter level, but that does not have to be tied
> > to a Scsi_Device. I need to search for Scsi_Device::hostdata usage to
> > figure out details, and to figure out if anything is broken in the current
> > scsi multi-path code - right now it requires the same adapter drivers be
> > used and that certain Scsi_Host parameters are equal if multiple paths
> > to a Scsi_Device are found.
>
> This seems to be a serious limitation. There are good reasons for wanting to
> use different HBAs for the different paths.
What reasons? Adapter upgrades/replacement on a live system? I can imagine
someone using different HBAs so that they won't hit the same bug in both
HBAs, but that is a weak argument; I would think such systems would want
some type of cluster failover.
If the HBAs had the same memory and other limitations, it should function
OK, but it is hard to figure out exactly what might happen (if the HBAs had
different error handling characteristics, handled timeouts differently,
etc.). It would be easy to get rid of the checking for the same drivers,
(the code actually checks for the same drivers via Scsi_Host::hostt, not
the same hardware) - so it would allow multiple paths if the same driver
is used for different HBA's.
> And the Scsi_Device might be quite different. Imagine something like two
> storage boxes which do internal replication among them; yes, you'd only want
> to use one of them normal (because the Cache-coherency traffic is going to
> kill performance otherwise), but you can failover from one to the other even
> if they have different SCSI serials etc.
> And software RAID on top of multi-pathing is a typical example for a truely
> fault tolerant configuration.
>
> Thats obviously easier with md, and I assume your SCSI code can also do that
> nicely.
I haven't tried it, but I see no reason why it would not work.
> > Agreed, but having the block layer be everything is also wrong.
>
> Having the block device handling all block devices seems fairly reasonable to
> me.
Note that scsi uses the block device layer (the request_queue_t) for
character devices - look at st.c, sg.c, and sr*.c, calls to scsi_do_req()
or scsi_wait_req() queue to the request_queue_t. Weird but it works - you can
open a CD via sr and sg at the same time.
> > My view is that md/volume manager multi-pathing is useful with 2.4.x, scsi
> > layer multi-path for 2.5.x, and this (perhaps with DASD) could then evolve
> > into generic block level (or perhaps integrated with the device model)
> > multi-pathing support for use in 2.7.x. Do you agree or disagree with this
> > approach?
>
> Well, I guess 2.5/2.6 will have all the different multi-path implementations
> mentioned so far (EVMS, LVM2, md, scsi, proprietary) - they all have code and
> a userbase... All of them and future implementations can benefit from better
> error handling and general cleanup, so that might be the best to do for now.
>
> I think it is too soon to clean that up and consolidate the m-p approaches,
> but I think it really ought to be consolidated in 2.7, and this seems like a
> good time to start planning for that one.
>
>
> Sincerely,
> Lars Marowsky-Br?e <[email protected]>
The scsi multi-path code is not in 2.5.x, and I doubt it will be accepted
without the support of James and others.
-- Patrick Mansfield
On Tue, Sep 10, 2002 at 12:00:47PM -0700, Patrick Mochel wrote:
>
> On Tue, 10 Sep 2002, Patrick Mansfield wrote:
>
> > On Tue, Sep 10, 2002 at 10:21:53AM -0700, Patrick Mochel wrote:
> > > Easy: you map the unique identifier of the device to a name in userspace.
> > > In our utopian future, /sbin/hotplug is called with that unique ID as one
> > > of its parameters. It searches for, and finds names based on the ID is. If
> > > the name(s) already exist, then it doesn't continue.
> > >
> > >
> > > -pat
> >
> > But then if the md or volume manager wants to do multi-path IO it
> > will not be able to find all of the names in userspace since the
> > extra ones (second path and on) have been dropped.
>
> Which is it that you want? One canonical name or all the paths? I supplied
> a solution for the former in my repsonse. The latter is solved via the
> exposure of the paths in driverfs, which has been discussed previously.
>
>
> -pat
For scsi multi-path, one name; without scsi multi-path (or for individual
paths that are not exposed in driverfs) each path probably needs to show up
in user space with a different name so md or other volume managers can use
them.
-- Patrick Mansfield
On Monday September 9, [email protected] wrote:
>
> Well, neither of the people most involved in the development (that's Neil
> Brown for md in general and Ingo Molnar for the multi-path enhancements) made
> any comments---see if you can elicit some feedback from either of them.
I'm fairly un-interested in multipath. I try not to break it while
tidying up the generic md code, but apart from that I leave it alone.
For failover, I suspect that md is an appropriate place for multipath,
though it would be nice to get more detail error information from the
lower levels.
For load balancing you really need something lower down, just below
the elevator would seem right: at the request_fn level rather than
make_request_fn.
But all that has pretty much been said.
NeilBrown
[email protected] said:
> The scsi multi-path code is not in 2.5.x, and I doubt it will be
> accepted without the support of James and others.
I haven't said "no" yet (and Doug and Jens haven't said anything). I did say
when the patches first surfaced that I didn't like the idea of replacing
Scsi_Device with Scsi_Path at the bottom and the concomitant changes to all
the Low Level Drivers which want to support multi-pathing. If this is to go
in the SCSI subsystem it has to be self contained, transparent and easily
isolated. That means the LLDs shouldn't have to be multipath aware.
I think we all agree:
1) that multi-path in SCSI isn't the way to go in the long term because other
devices may have a use for the infrastructure.
2) that the scsi-error handler is the big problem
3) that errors (both medium and transport) may need to be propagated
immediately up the block layer in order for multi-path to be handled
efficiently.
Although I outlined my ideas for a rework of the error handler, they got lost
in the noise of the abort vs reset debate. These are some of the salient
features that will help in this case
- no retries from the tasklet.
- Quiesce from above, not below (commands return while eh processes, so we
begin with the first error and don't have to wait for all commands to return
or error out)
- It's the object of the error handler to return all commands to the block
layer for requeue and reorder as quickly as possible. They have to be
returned with an indication from the error handler that it would like them
retried. This indication can be propagated up (although I haven't given
thought how to do that). Any commands that are sent down to probe the device
are generated from within the error handler thread (no device probing with
live commands).
What other features do you need on the eh wishlist?
James
On 2002-09-11T09:20:38,
James Bottomley <[email protected]> said:
> [email protected] said:
> > The scsi multi-path code is not in 2.5.x, and I doubt it will be
> > accepted without the support of James and others.
> I haven't said "no" yet (and Doug and Jens haven't said anything).
Except for dasds, all devices I care for with regard to multipathing _are_
SCSI, so that would solve at least 90% of my worries in the mid-term. And also
do multipathing for !block devices, in theory.
It does have the advantage of knowing the topology better than the block
layer; the chance that a path failure only affects one of the LUNs on a device
is pretty much nil, so it would speed up error recovery.
However, I could also live with this being handled in the volume manager /
device mapper. This would transcend all potential devices - if the character
devices were really mapped through the block layer (didn't someone have this
weird idea once? ;-), it would too work for !block devs.
Exposing a better error reporting upwards is also definetely a good idea, and
if the device mapper could have the notion of "to what topology group does
this device belong to", or even "distance metric (without going into further
detail on what this is, as long as it is consistent to the physical layer) to
the current CPU" (so that the shortest path in NUMA could be selected), that
would be kinda cool ;-) And doesn't seem too intrusive.
Now, what I definetely dislike is the vaste amount of duplicated code. I'm not
sure whether we can get rid of this in 2.5 timeframe though.
<rant>
If EVMS was cleaned up some, maybe used the neat LVM2 device mapper
internally, and in fact is a superset of everything we've had before and can
support everything from our past as well as give us what we really want
(multi-pathing, journaled RAID etc), and we do the above, I vote for a legacy
free kernel. Unify the damn block device management and throw out the old
code.
I hate cruft. Customers want to and will use it. Someone has to support it. It
breaks stuff. It cuts a 9 of my availability figure. ;-)
</rant>
> I think we all agree:
I agree here.
> 3) that errors (both medium and transport) may need to be propagated
> immediately up the block layer in order for multi-path to be handled
> efficiently.
Right. All the points you outline about the error handling are perfectly
valid.
But one of the issues with the md layer right now for example is the fact that
an error on a backup path will only be noticed as soon as it actually tries to
use it, even if the cpqfc (to name the culprit, worst code I've seen in a
while) notices a link-down event. It isn't communicated anywhere, and how
should it be?
This may be for later, but is something to keep in mind.
> thought how to do that). Any commands that are sent down to probe the device
> are generated from within the error handler thread (no device probing with
> live commands).
If the path was somehow exposed to user-space (as it is with md right now),
the reprobing could even be done outside the kernel. This seems to make sense,
because a potential extensive device-specific diagnostic doesn't have to be
folded into it then.
> What other features do you need on the eh wishlist?
The other features on my list - prioritizing paths, a useful, consistent user
interface via driverfs/devfs/procfs - are more or less policy I guess.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
[email protected] said:
> and if the device mapper could have the notion of "to what topology
> group does this device belong to", or even "distance metric (without
> going into further detail on what this is, as long as it is consistent
> to the physical layer) to the current CPU" (so that the shortest path
> in NUMA could be selected), that would be kinda cool ;-) And doesn't
> seem too intrusive.
I think I see driverfs as the solution here. Topology is deduced by examining
certain device and HBA parameters. As long as these parameters can be exposed
as nodes in the device directory for driverfs, a user level daemon map the
topology and connect the paths at the top. It should even be possible to
weight the constructed multi-paths.
This solution appeals because the kernel doesn't have to dictate policy, all
it needs to be told is what information it should be exposing and lets user
level get on with policy determination (this is a mini version of why we
shouldn't have network routing policy deduced and implemented by the kernel).
> But one of the issues with the md layer right now for example is the fact that
> an error on a backup path will only be noticed as soon as it actually tries to
> use it, even if the cpqfc (to name the culprit, worst code I've seen in a
> while) notices a link-down event. It isn't communicated anywhere, and how
> should it be?
I've been think about this separately. FC in particular needs some type of
event notification API (something like "I've just seen this disk" or "my loop
just went down"). I'd like to leverage a mid-layer api into hot plug for some
of this, but I don't have the details worked out.
> If the path was somehow exposed to user-space (as it is with md right now),
> the reprobing could even be done outside the kernel. This seems to make sense,
> because a potential extensive device-specific diagnostic doesn't have to be
> folded into it then.
The probing issue is an interesting one. At least SCSI has the ability to
probe with no IO (something like a TEST UNIT READY) and I assume other block
devices have something similar. Would it make sense to tie this to a single
well known ioctl so that you can probe any device that supports it without
having to send real I/O?
> The other features on my list - prioritizing paths, a useful, consistent user
> interface via driverfs/devfs/procfs - are more or less policy I guess.
Mike Sullivan (of IBM) is working with Patrick Mochel on this.
James
On 2002-09-11T14:37:40,
James Bottomley <[email protected]> said:
> I think I see driverfs as the solution here. Topology is deduced by
> examining certain device and HBA parameters. As long as these parameters
> can be exposed as nodes in the device directory for driverfs, a user level
> daemon map the topology and connect the paths at the top. It should even be
> possible to weight the constructed multi-paths.
Perfect, I agree, should've thought of it. As long as this is simple enough
that it can be done in initrd (if / is on a multipath device...).
The required weighting has already been implemented in the LVM1 patch by IBM.
While it appeared overkill to me for "simple" cases, I think it is suited to
expressing proximity.
> This solution appeals because the kernel doesn't have to dictate policy,
Right.
> I've been think about this separately. FC in particular needs some type of
> event notification API (something like "I've just seen this disk" or "my
> loop just went down"). I'd like to leverage a mid-layer api into hot plug
> for some of this, but I don't have the details worked out.
This isn't just FC, but also dasd on S/390. Potentially also network block
devices, which can notice a link down.
> The probing issue is an interesting one. At least SCSI has the ability to
> probe with no IO (something like a TEST UNIT READY) and I assume other block
> devices have something similar. Would it make sense to tie this to a single
> well known ioctl so that you can probe any device that supports it without
> having to send real I/O?
Not sufficient. The test is policy, so the above applies here too ;-)
In the case of talking to a dual headed RAID box for example, TEST UNIT READY
might return OK, but the controller might refuse actual IO, or the path may be
somehow damaged in a way which is only detected by doing some "large" IO. Now,
this might be total overkill for other scenarios.
I vote for exposing the path via driverfs (which, I think, is already
concensus so the multipath group, topology etc can be used) and allowing
user-space to reenable them after doing whatever probing deemed necessary.
What are your ideas on the potential timeframe?
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
On Wed, Sep 11, 2002 at 09:20:38AM -0500, James Bottomley wrote:
> [email protected] said:
> > The scsi multi-path code is not in 2.5.x, and I doubt it will be
> > accepted without the support of James and others.
>
> I haven't said "no" yet (and Doug and Jens haven't said anything).
Well, I for one was gone on vacation, and I'm allowed to ignore linux-scsi
in such times, so, as Bill de Cat would say, thptptptppt! :-)
> I did say
> when the patches first surfaced that I didn't like the idea of replacing
> Scsi_Device with Scsi_Path at the bottom and the concomitant changes to all
> the Low Level Drivers which want to support multi-pathing. If this is to go
> in the SCSI subsystem it has to be self contained, transparent and easily
> isolated. That means the LLDs shouldn't have to be multipath aware.
I agree with this.
> I think we all agree:
>
> 1) that multi-path in SCSI isn't the way to go in the long term because other
> devices may have a use for the infrastructure.
I'm not so sure about this. I think in the long run this is going to end
up blurring the line between SCSI layer and block layer IMHO.
> 2) that the scsi-error handler is the big problem
Aye, it is, and for more than just this issue.
> 3) that errors (both medium and transport) may need to be propagated
> immediately up the block layer in order for multi-path to be handled
> efficiently.
This is why I'm not sure I agree with 1. If we are doing this, then we
are sending up SCSI errors at which point the block layer now needs to
know SCSI specifics in order to properly decide what to do with the error.
That, or we are building specific "is this error multipath relevant" logic
into the SCSI layer and then passing the result up to the block layer.
I'm the kind of person that my preference would be that either A) the SCSI
layer doesn't know jack about multipath and the block layer handles it all
or B) the block layer doesn't know about our multipath and the SCSI layer
handles it all. I don't like the idea of mixing them at this current
point in time (there really isn't much of a reason to mix them yet, and
people can only speculate that there might be reason to do so later).
> Although I outlined my ideas for a rework of the error handler, they got lost
> in the noise of the abort vs reset debate. These are some of the salient
> features that will help in this case
[ snipped eh features ]
I'll have to respond to these items separately. They cross over some with
these issues, but really they aren't tired directly together and deserve
separate consideration.
--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606
Doug Ledford [[email protected]] wrote:
> On Wed, Sep 11, 2002 at 09:20:38AM -0500, James Bottomley wrote:
> > [email protected] said:
> > I did say
> > when the patches first surfaced that I didn't like the idea of replacing
> > Scsi_Device with Scsi_Path at the bottom and the concomitant changes to all
> > the Low Level Drivers which want to support multi-pathing. If this is to go
> > in the SCSI subsystem it has to be self contained, transparent and easily
> > isolated. That means the LLDs shouldn't have to be multipath aware.
>
> I agree with this.
>
In the mid-level mp patch the adapters are not aware of multi-path. The
changes to adapters carried in the patch have to do with a driver not
allowing aborts during link down cases or iterating over the host_queue
for ioctl, /proc reasons. The hiding of some of the lists behind APIs is
something I had to do in the host list cleanup. We might even do some of
this same list cleanup outside of mp.
Also add "my me to" on the scsi error handling is lacking statement. I
am currently trying to do something about not using the failed command
for error recovery (post abort).
-Mike
--
Michael Anderson
[email protected]
On Wed, Sep 11, 2002 at 02:37:40PM -0500, James Bottomley wrote:
> [email protected] said:
> > and if the device mapper could have the notion of "to what topology
> > group does this device belong to", or even "distance metric (without
> > going into further detail on what this is, as long as it is consistent
> > to the physical layer) to the current CPU" (so that the shortest path
> > in NUMA could be selected), that would be kinda cool ;-) And doesn't
> > seem too intrusive.
>
> I think I see driverfs as the solution here. Topology is deduced by examining
> certain device and HBA parameters. As long as these parameters can be exposed
> as nodes in the device directory for driverfs, a user level daemon map the
> topology and connect the paths at the top. It should even be possible to
> weight the constructed multi-paths.
>
> This solution appeals because the kernel doesn't have to dictate policy, all
> it needs to be told is what information it should be exposing and lets user
> level get on with policy determination (this is a mini version of why we
> shouldn't have network routing policy deduced and implemented by the kernel).
Not coincidentally, network routing policy _is_ multipath config in
the iSCSI or nbd case.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
In article <[email protected]> you wrote:
>> I've been think about this separately. FC in particular needs some type of
>> event notification API (something like "I've just seen this disk" or "my
>> loop just went down"). I'd like to leverage a mid-layer api into hot plug
>> for some of this, but I don't have the details worked out.
> This isn't just FC, but also dasd on S/390. Potentially also network block
> devices, which can notice a link down.
It is true for every device. Starting from a IDE Disk which can fail to a
SCSI DEvice, to a network link, even hot Plug-PCI Cards and CPUs, RAM
Modules and so on can use this API.
Recovering from an ahardware failure is a major weakness of non-host
operating systems. Linux Filesystems used to panic much too often. It is
getting better, but it is still a long way to allow multipath IO, especially
in environemnts where it is more complicated than a faild FC loop. For
example 2 FC adapters on 2 different PCI bus should be able to deactivate
themself if IO with the adapter locks.
> I vote for exposing the path via driverfs (which, I think, is already
> concensus so the multipath group, topology etc can be used) and allowing
> user-space to reenable them after doing whatever probing deemed necessary.
There are situations where a path is reenabled by the kernel, for example
network interfaces. But it makes sence in a HA environemnt to move this to a
user mode daemon, simply because the additional stability is worth the extra
daemon. On the other hand, kernel might need to detect congestion/jam
anyway, especially if load balancing is used.
Personally I like the simple md approach for IO Multipath, as long as not
the whole kernel is able to operate sanely with changing hardware topology.
This is especially less intrusiver to all the drivers out there. And allows
funny combinations like SAN with NAS Backups.
Greetings
Bernd