2007-10-13 19:09:55

by Rob Landley

[permalink] [raw]
Subject: What still uses the block layer?

My impression from asking questions on the linux-scsi mailing list is that the
scsi upper/middle/lower layers doesn't use the block layer described in
Documentation/block/*.

For example, the scsi guys say:
http://marc.info/?l=linux-scsi&m=118633268527856&w=2

Instead of using the block layer, SCSI reinvents this particular wheel itself.
There's a scsi "upper layer" that provides /dev nodes, scsi low-level
drivers, and a gigantic glue layer in between call the "scsi midlayer" that's
something like a networking stack, and is responsible for losing track of all
your devices so that the one SATA disk hardwired into your laptop might be
sda or sdc depending on whether or not you had a USB key plugged in when you
booted up. Anyway, the block layer isn't between any of these three, that I
can tell.

Now that IDE disks have been rerouted through the scsi layer, SATA goes
through the scsi layer, USB goes through the scsi layer, firewire goes
through the scsi layer... What's left? It seems like everything but
ramdisks have now been routed through the scsi layer. My laptop hasn't got a
single SCSI device but it also hasn't got any block devices that don't show
up as scsi.

So what's still using the block layer? How do the scsi layers and the block
layer relate? I'm confused! (This is normal for me, but still...)

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.


2007-10-13 22:05:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Thu, Oct 11, 2007 at 08:11:21PM -0500, Rob Landley wrote:
> My impression from asking questions on the linux-scsi mailing list is that the
> scsi upper/middle/lower layers doesn't use the block layer described in
> Documentation/block/*.

Entirely incorrect.

> Instead of using the block layer, SCSI reinvents this particular wheel itself.
> There's a scsi "upper layer" that provides /dev nodes, scsi low-level
> drivers, and a gigantic glue layer in between call the "scsi midlayer" that's
> something like a networking stack, and is responsible for losing track of all
> your devices so that the one SATA disk hardwired into your laptop might be
> sda or sdc depending on whether or not you had a USB key plugged in when you
> booted up. Anyway, the block layer isn't between any of these three, that I
> can tell.

You really need to get the fuck over yourself.

> Now that IDE disks have been rerouted through the scsi layer, SATA goes
> through the scsi layer, USB goes through the scsi layer, firewire goes
> through the scsi layer... What's left? It seems like everything but
> ramdisks have now been routed through the scsi layer. My laptop hasn't got a
> single SCSI device but it also hasn't got any block devices that don't show
> up as scsi.

That's nice. Why not take a look in drivers/block? Floppy, CCISS,
CPQDA, UMEM, UBD, loop, NBD, SX8, UB, AoE, and many more.

> So what's still using the block layer? How do the scsi layers and the block
> layer relate? I'm confused! (This is normal for me, but still...)

sd and sr are block drivers. In fact, the whole SCSI subsystem ...

depends on BLOCK

Just take a look at sd.c. The init code reads:

for (i = 0; i < SD_MAJORS; i++)
if (register_blkdev(sd_major(i), "sd") == 0)
majors++;

Then look at struct scsi_cmnd. It has a pointer to the block request
that was passed down to it. struct scsi_device has a pointer to the
block request_queue that's associated with the device. Block is what
has elevators and io schedulers -- that work isn't duplicated by scsi.
There's work to push more of scsi's infrastructure up into the block
layer, so non-scsi block devices can take advantage of it.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-10-14 05:54:45

by David Newall

[permalink] [raw]
Subject: Re: What still uses the block layer?

Matthew Wilcox wrote:
> You really need to get the fuck over yourself.

That is so rude. You need to learn some manners.

2007-10-14 17:47:16

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

David Newall wrote:
> That is so rude.

Such responses sometimes happen after provocative posts like the thread
starter's. He could have asked straight away for help with fixing his
boot environment instead of wrapping his question into a feigned design
discussion. It appeared as if he is out for a fight rather than
interested in help.
--
Stefan Richter
-=====-=-=== =-=- -===-
http://arcgraph.de/sr/

2007-10-14 22:24:45

by James Bottomley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sat, 2007-10-13 at 16:05 -0600, Matthew Wilcox wrote:
> On Thu, Oct 11, 2007 at 08:11:21PM -0500, Rob Landley wrote:
> > My impression from asking questions on the linux-scsi mailing list is that the
> > scsi upper/middle/lower layers doesn't use the block layer described in
> > Documentation/block/*.
>
> Entirely incorrect.

OK, right ... could we please get a sense of decorum back on this list.

Rob, if you didn't ask your alleged questions in such a pejorative
manner, we'd get a lot further; and Matthew, if you didn't rise to the
bait so spectacularly it wouldn't prolong these threads.

Really, both of you, I have better things to do with my time than
mediate behaviours that should have been educated out of you in the
kindergarten sand pit.

James


2007-10-14 22:35:59

by Tilman Schmidt

[permalink] [raw]
Subject: Re: What still uses the block layer?

Am 14.10.2007 19:46 schrieb Stefan Richter:
> David Newall wrote:
>> That is so rude.
>
> Such responses sometimes happen after provocative posts like the thread
> starter's.

Provocation is often in the eye of the beholder, and basic manners
should be observed nevertheless.

> He could have asked straight away for help with fixing his
> boot environment instead of wrapping his question into a feigned design
> discussion.

No, he couldn't have. He quite obviously didn't even know enough
to understand his boot environment might be at fault, and hence
was unable to conceive the question you're demanding from him.

> It appeared as if he is out for a fight rather than
> interested in help.

It may have appeared like that from the highly antagonistic mindset
that seems so prevalent in LKML. But if one just stepped back and
took a breath before answering it should have been quite obvious
that he wasn't. (out for a fight, that is)

Granted, it can be difficult to comprehend the point of view of
someone who does not know or understand something you yourself know
or understand well. But you should at least be aware of that
inability, and consequently refrain from accusing of provocation
where there may be none. Hanlon's razor, cynical as it may sound at
first, is an eminently humanistic principle.

--
Tilman Schmidt E-Mail: [email protected]
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge?ffnet mindestens haltbar bis: (siehe R?ckseite)


Attachments:
signature.asc (253.00 B)
OpenPGP digital signature

2007-10-14 23:37:24

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sunday 14 October 2007 12:46:12 pm Stefan Richter wrote:
> David Newall wrote:
> > That is so rude.

When a reply contains as a reply to the first paragraph "you're wrong" with no
elaboration, and as a reply to the second paragraph nothing but expletives
and personal insults, I tend to stop reading. It really doesn't come across
as a serious reply.

I was at least attempting to ask a serious question.

> Such responses sometimes happen after provocative posts like the thread
> starter's. He could have asked straight away for help with fixing his
> boot environment instead of wrapping his question into a feigned design
> discussion. It appeared as if he is out for a fight rather than
> interested in help.

Actually, I was going through Documentation/block thinking about making a
00-INDEX for it, but my earlier questions of the scsi guys left me with the
impression that the block layer is _not_ used by the SCSI layer. And since
every non-embedded modern storage device I'm aware of has been consumed by
the SCSI layer (despite none of them actually having a discernably closer
relationship to SCSI than ATA did), I didn't know whether or not it was more
appropriate to index this directory or request its deletion. So I asked.

Back when I asked the scsi guys about this, I got no direct answer. I
asked "where does the block layer work into this" in the context of questiosn
about the relationship between the scsi upper, middle, and lower layers, and
I never got a reply, even though the question was quoted back at me here:
http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg09086.html

The closest I got to an answer was later in the thread:
http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg09131.html

Which said:
> That approach makes the Linux block layer either a nuisance,
> irrelevant or a complete anachronism (in the case of OSD).
> IMO the linux block layer should be morphed into a library
> of internal queue handling routines. Storage upper level
> drivers such as sd can continue to present the "block"
> view ** of storage devices such as disks.

The gist of the thread (and the documentation I was referred to) is that the
scsi "upper layer" presents /dev nodes and ioctls, the scsi mid-layer is a
routing layer very roughly analogus to a TCP/IP stack, and the scsi low-layer
drivers interface with specific pieces of hardware. Apparently, the block
layer is not between any of these, they talk directly to each other. This
would seem to indicate that I/O requests made to scsi devices are never
routed through a common block I/O request handling layer shared with non-SCSI
block devices. I was not, however, certain of this, hence my attempt to
bring the topic back up.

Oh, and sending a patch correcting Jens Axboe's address in this old
documentation. He's apparently at Oracle now...

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-14 23:46:07

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sunday 14 October 2007 5:24:32 pm James Bottomley wrote:
> On Sat, 2007-10-13 at 16:05 -0600, Matthew Wilcox wrote:
> > On Thu, Oct 11, 2007 at 08:11:21PM -0500, Rob Landley wrote:
> > > My impression from asking questions on the linux-scsi mailing list is
> > > that the scsi upper/middle/lower layers doesn't use the block layer
> > > described in Documentation/block/*.
> >
> > Entirely incorrect.
>
> OK, right ... could we please get a sense of decorum back on this list.

Did I reply to the insult?

> Rob, if you didn't ask your alleged questions in such a pejorative
> manner, we'd get a lot further

I'm not attempting to be pejorative.

I admit a certain amount of personal annoyance that once the SCSI layer
consumes a category of device (USB, SATA, PATA), they can often _only_ be
used by going through the SCSI midlayer. (This strikes me as analogous to
TCP/IP claiming ethernet and PPP devices so thoroughly that you can no longer
address them as eth1 or /dev/ttyS0.)

This has the annoying effect of bundling together different types of devices
and making device enumeration unnecessarily difficult: my laptop only has one
SATA hard drive and can't gain another without a soldering iron, but that
drive could move from /dev/sda to /dev/sdb if I reboot the system with a USB
key plugged in. This seems like a regrettable loss of orthogonality to me.
I remember back when /dev/usb0 and /dev/hda were separate devices that showed
up in /dev, but these days "it's SCSI" seems to trump "it's USB", "it's ATA",
or "it's SATA". (Even though none of those are actually SCSI hardware, they
just send a similar packet protocol across the wire.)

The fact that udev can theoretically unwind this hairball is not an excuse for
conflating different categories of devices in the first place. Avoiding an
unnecessary problem seems superior to trying to get udev to solve it. Note
that Ubuntu 7.04 solves it by sticking a UUID on every _partition_, and then
spinning up my external USB hard drive trying to find the root partition on a
reboot. Tell me how this can be considered progress:

> # /etc/fstab: static file system information.
> #
> # <file system> <mount point> <type> <options> <dump> <pass>
> proc /proc proc defaults 0 0
> # /dev/sda1
> UUID=04d1b984-bd65-46f1-9a77-c158cf4bed1b / ext3
defaults,errors=remount-ro,noatime 0 1
> # /dev/sda5
> UUID=cdf0936d-9f19-42c6-b131-9fefcf1321ef none swap sw
0 0
> /dev/scd0 /media/cdrom0 udf,iso9660 user,noauto 0 0
> UUID=86bbb512-ab7e-4a12-8618-1190f032c082 /boot ext3 defaults 0 0

Conflating categories of hardware that cannot easily be enumerated (USB) with
categories that can (the SATA hard drive in my laptop, of which there can be
only one) strikes me as a bad thing. Putting them in a common "scsi device
pool" within which they do not enumerate consistently is not something I
enjoy dealing with.

However, the response to my attempts to express this dissatisfaction on the
SCSI list a few months ago came too close to a flamewar for me to consider
continuing it productive. I'd still love to update the "2.4 scsi howto" and
corresponding sg howto, but lack the expertise. The SCSI layer really isn't
my area, and I was much happier back when I could avoid using it at all.

The question I was trying to ask _here_ was about the block layer. I seem not
to have asked it very well. Sorry 'bout that.

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 00:45:58

by Luben Tuikov

[permalink] [raw]
Subject: Re: What still uses the block layer?

--- James Bottomley <[email protected]> wrote:
> On Sat, 2007-10-13 at 16:05 -0600, Matthew Wilcox wrote:
> > On Thu, Oct 11, 2007 at 08:11:21PM -0500, Rob Landley wrote:
> > > My impression from asking questions on the linux-scsi mailing list is that the
> > > scsi upper/middle/lower layers doesn't use the block layer described in
> > > Documentation/block/*.
> >
> > Entirely incorrect.
>
> OK, right ... could we please get a sense of decorum back on this list.
>
> Rob, if you didn't ask your alleged questions in such a pejorative
> manner, we'd get a lot further; and Matthew, if you didn't rise to the
> bait so spectacularly it wouldn't prolong these threads.
>
> Really, both of you, I have better things to do with my time than
> mediate behaviours that should have been educated out of you in the
> kindergarten sand pit.

I really didn't find Rob's email "pejorative" at all. It seems to me
he was just asking for clarification, information and trying to
understand how it all works and ties together. His email seemed
genuine enough of a person just asking to understand how it all works.

Matthew's expletive and extremely rude response really shows
the general attitude of the linux-scsi people.

Heck, I got a similar response just a week ago here on the
list, trying to convince Garzik and his band, that storage nodes
SHOULD NOT be SAS WWN generators. Should I have even tried? That's
the question.

Good luck everyone,
Luben

2007-10-15 01:24:14

by NeilBrown

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sunday October 14, [email protected] wrote:
> On Sunday 14 October 2007 12:46:12 pm Stefan Richter wrote:
> > David Newall wrote:
> > > That is so rude.
>
> When a reply contains as a reply to the first paragraph "you're wrong" with no
> elaboration, and as a reply to the second paragraph nothing but expletives
> and personal insults, I tend to stop reading. It really doesn't come across
> as a serious reply.
>
> I was at least attempting to ask a serious question.

Indeed you were, and let me try to answer it as best I can.

I like to think of the "block layer" as two main parts.

Firstly there is the "interface" which it defines, embodied primarily
in generic_make_request() and 'struct bio'. There are various other
small routines in ll_rw_blk.c, and there is 'struct request_queue'
which is also involved in the other half of the block layer.

This interface defines how requests are passed down, how their
completion is acknowledged, and various other little details

Any block device can register a make_request_fn function and get the
requests (struct bio) almost exactly as the client (filesystem or
whatever) sent them down - just with a few sanity checks and some
translation (for partitions) applied.

The other half of the "block layer" is the io scheduler code.
This involves the 'struct request' and __make_request() and the various
routines it calls.
This collects bios (passed down from clients) and produces 'requests'
which devices can handle. One of the important differences between
bios and requests is the amount of parallelism.
A filesystem can send down as may concurrent bios as it likes (or as
it can allocate memory for).
A device can only handle a limited number of requests at a time,
depending on the limit of the 'tags command queueing' mechanism
particular to that device.
The scheduler bridges this parallelism gap by .... scheduling.

So the "block layer" consists of "block interface" and "io scheduler"

All block devices use the "block interface" - they have no choice.
Many block devices use the "io scheduler", but many don't.
md and dm, loop, umem, and others do their own scheduling as they have
needs that are specific to the devices, or that otherwise don't
benefit from the io scheduler (which is really designed for
rotating-media style devices).

SCSI devices can be both block device and non-block devices
(traditionally 'char devices').

The 'scsi generic' or 'sg' interface to SCSI devices allows arbitrary
SCSI commands to be sent to a SCSI device. There are many SCSI
devices that are not block devices as all (media robots, etc).

When a SCSI device is being used as a block device, the block
interface is used. When it is being used as a 'generic device', the
block interface is not used.

Now we get to the heart of the matter, and to where my knowledge
becomes a little less detailed - so please forgive if I say something
silly.

I believe that the SCSI-generic handling still uses the IO scheduler,
even though it doesn't use the block interface.
It is probable that the IO scheduler is not a perfect match for the
needs of SCSI-generic handling. Given it's origin, that should not be
surprising.

I believe the linux-scsi email that you referred was addressing this
issue. When the author says:

That approach makes the Linux block layer either a nuisance,
irrelevant or a complete anachronism

I believe he is referring to what I would call the IO scheduler, and is
observing that it is not a perfect fit. He is probably right.

So to answer your question:

SCSI block devices use both the "block interface" and the "io
scheduler" and I believe that when people talk about "the block layer"
they refer to these two things.
i.e. the SCSI layer provides "scsi_request_fn". The block interface
calls __make_request which performs IO scheduling and calls
scsi_request_fn for each request.

Hope that helps.

NeilBrown

2007-10-15 01:46:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sun, Oct 14, 2007 at 06:45:44PM -0500, Rob Landley wrote:
> I admit a certain amount of personal annoyance that once the SCSI
> layer consumes a category of device (USB, SATA, PATA), they can
> often _only_ be used by going through the SCSI midlayer. (This
> strikes me as analogous to TCP/IP claiming ethernet and PPP devices
> so thoroughly that you can no longer address them as eth1 or
> /dev/ttyS0.)

That's because modern USB, ATAPI (what was once known as IDE), SATA
really *all* using the SCSI command protocols at the low level, just
as Ethernet and PPP interfaces really are fundamentally the same
thing. You can rail against it, but that's the mark of someone who
refuses to accept reality.

> This has the annoying effect of bundling together different types of
> devices and making device enumeration unnecessarily difficult: my
> laptop only has one SATA hard drive and can't gain another without a
> soldering iron, but that drive could move from /dev/sda to /dev/sdb
> if I reboot the system with a USB key plugged in. This seems like a
> regrettable loss of orthogonality to me. I remember back when
> /dev/usb0 and /dev/hda were separate devices that showed up in /dev,
> but these days "it's SCSI" seems to trump "it's USB", "it's ATA", or
> "it's SATA". (Even though none of those are actually SCSI hardware,
> they just send a similar packet protocol across the wire.)

You're showing your ignorance here. In fact in the past few years,
ATA and SCSI has been converging significantly, with the ATAPI
specification has essentially incorporating the SCSI protocol by
reference and by value --- with the point that SAS was developed by
the SCSI Trade Association, and SAS is effectively a superset of SATA,
to the point where with care, you can actually mix SAS and SATA drives
on the same in enclosure (SAS and SATA are physically compatible on
the connector level).

More to the point, with SATA, hot plugging has been designed in, so
probing order is not going to be well defined, just as with USB
devices. And there are already relatively common situations where the
same disk can show up via multiple different interfaces.

For example, if you have a modern Thinkpad with an secondary SATA hard
drive in an Ultrabay, and you plug it into the Ultrabay in your T60,
it will show up as a SATA drive. However, if you plug it into the
Advanced dock, it shows up as a USB device. And with iSCSI not only
can you encapsulate a SCSI command stream over USB, you can do so over
IP as well. In any case, regardless of how the physical SATA drive is
attached to the system, you want it to show up as the same device and
be mounted in the same location.

That's why identifying filesystem by UUID's or Labels is so critical.
This is not a new concept; we've had the capability to do this for
over a decade, and I always knew it would be necessary for us to do
this sooner or later --- which is why I added the UUID support to ext2
back in 1996.

> The fact that udev can theoretically unwind this hairball is not an
> excuse for conflating different categories of devices in the first
> place.

See the thinkpad Ultrabay drive example above. You address hosts by
IP address; it doesn't matter whether you access them via a PPP
interface, or a wireless interface, or a ethernet interface.
Similarly, a disk could in theory be accessible over USB, SATA, or
iSCSI, and the Thinkpad example is only one such where the same
filesystem might be accessible over multiple interfaces. And with
multipath fiber channel SAN's (and I hate to break it to you, but FC
also uses SCSI protocols) storage is very much looking more and more
like networking.

- Ted

2007-10-15 05:45:20

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

Rob Landley wrote:
> I was at least attempting to ask a serious question.
...
> Actually, I was going through Documentation/block thinking about making a
> 00-INDEX for it, but my earlier questions of the scsi guys left me with the
> impression that the block layer is _not_ used by the SCSI layer.

Ah, so it was about your documentation work. I already forgot the
context of your previous inquiries. Alas the tone of them already did
some damage, leading to responses like these.

...
> since
> every non-embedded modern storage device I'm aware of has been consumed by
> the SCSI layer (despite none of them actually having a discernably closer
> relationship to SCSI than ATA did)
...

The Linux SCSI subsystems don't consume, they provide services; nowadays
not only for SCSI hardware and SCSI protocols but also for a number of
subsystems whose tasks are similar enough to SCSI subsystems to make the
SCSI core and upper SCSI layer useful to them too.

BTW:
| Now that IDE disks have been rerouted through the scsi layer, SATA goes
| through the scsi layer, USB goes through the scsi layer, firewire goes
| through the scsi layer...

As a side note, SBP-2 is a SCSI transport protocol, hence ieee1394/sbp2
and firewire/fw-sbp2 are Linux SCSI low-level drivers. Anything else
would be just wrong and infeasible in this particular case.
--
Stefan Richter
-=====-=-=== =-=- -====
http://arcgraph.de/sr/

2007-10-15 06:06:19

by Greg KH

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sun, Oct 14, 2007 at 06:45:44PM -0500, Rob Landley wrote:
> On Sunday 14 October 2007 5:24:32 pm James Bottomley wrote:
> > On Sat, 2007-10-13 at 16:05 -0600, Matthew Wilcox wrote:
> > > On Thu, Oct 11, 2007 at 08:11:21PM -0500, Rob Landley wrote:
> > > > My impression from asking questions on the linux-scsi mailing list is
> > > > that the scsi upper/middle/lower layers doesn't use the block layer
> > > > described in Documentation/block/*.
> > >
> > > Entirely incorrect.
> >
> > OK, right ... could we please get a sense of decorum back on this list.
>
> Did I reply to the insult?
>
> > Rob, if you didn't ask your alleged questions in such a pejorative
> > manner, we'd get a lot further
>
> I'm not attempting to be pejorative.
>
> I admit a certain amount of personal annoyance that once the SCSI layer
> consumes a category of device (USB, SATA, PATA), they can often _only_ be
> used by going through the SCSI midlayer. (This strikes me as analogous to
> TCP/IP claiming ethernet and PPP devices so thoroughly that you can no longer
> address them as eth1 or /dev/ttyS0.)

If you hate USB storage devices using scsi, please use the ub driver,
that is what it was written for.

> This has the annoying effect of bundling together different types of devices
> and making device enumeration unnecessarily difficult: my laptop only has one
> SATA hard drive and can't gain another without a soldering iron, but that
> drive could move from /dev/sda to /dev/sdb if I reboot the system with a USB
> key plugged in. This seems like a regrettable loss of orthogonality to me.
> I remember back when /dev/usb0 and /dev/hda were separate devices that showed
> up in /dev, but these days "it's SCSI" seems to trump "it's USB", "it's ATA",
> or "it's SATA". (Even though none of those are actually SCSI hardware, they
> just send a similar packet protocol across the wire.)

When did usb-storage devices ever show up as /dev/usb0? USB flash disks
are really SCSI devices, look at the USB storage spec for proof of that.

thanks,

greg k-h

2007-10-15 06:51:48

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sunday 14 October 2007 7:45:46 pm Luben Tuikov wrote:
> Matthew's expletive and extremely rude response really shows
> the general attitude of the linux-scsi people.

No, it doesn't. James Bottomley has been exceedingly polite and helpful, as
were several other people on the linux-scsi list when I asked them about this
stuff back in August.

Religion, politics, and anything remotely related to hotplug appear to be
topics to avoid in polite company if you want it to remain polite. (My
gripes with scsi mostly have to do with device enumeration. My attempts to
use sysfs also have to do with device enumeration. I've spotted a trend
here.)

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 08:04:22

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sunday 14 October 2007 8:45:03 pm Theodore Tso wrote:
> On Sun, Oct 14, 2007 at 06:45:44PM -0500, Rob Landley wrote:
> > I admit a certain amount of personal annoyance that once the SCSI
> > layer consumes a category of device (USB, SATA, PATA), they can
> > often _only_ be used by going through the SCSI midlayer. (This
> > strikes me as analogous to TCP/IP claiming ethernet and PPP devices
> > so thoroughly that you can no longer address them as eth1 or
> > /dev/ttyS0.)
>
> That's because modern USB, ATAPI (what was once known as IDE), SATA
> really *all* using the SCSI command protocols at the low level,

Ok, I'll bite. If it's all "real" scsi, why does ioctl(SG_EMULATED_HOST)
exist? exist if it's all "real" scsi?

> just
> as Ethernet and PPP interfaces really are fundamentally the same
> thing.

They're the same thing?

Do you mean that on a system with both, going:
ifconfig eth1 66.92.53.140
ifconfig ppp 192.168.0.42

Would be functionally equivalent to:
ifconfig eth1 192.168.0.42
ifconfig ppp 66.92.53.140

So if on one boot the addresses are assigned the first way, and upon reboot
they're assigned in the second way by exact the same set of commands... well
that's not IMPORTANT, is it? (Or is it that everyone everywhere should use
dhcp for everything, and static addressing is obsolete and no longer
supported? Apparently dhcp addresses should be delivered by machines with
only one network interface of any type...)

This is my objection. Even when enumerating multiple devices of the same type
is tricky, enumerating multiple devices of _different_ types should not be.
There's a great big type indicator that is being _deliberately_ ignored, and
large classes of devices (millions of laptops) where you know there's only
going to be _one_ instance of a given type.

By the way, ethernet cards contain a unique MAC address. Hard drives do not
seem to, or if they do it's not being consistently exposed in a way I can
find. This is sad. (No, reading data from the device to determine this gets
us back to the "spinning up the external USB drive to find my root partition"
gripe mentioned earlier.)

> You can rail against it, but that's the mark of someone who
> refuses to accept reality.

Let me clarify: I'm talking about device enumeration.

I've never had trouble enumerating a device that was _not_ routed through the
scsi layer, largely because the systems I work with don't usually have more
than one device of the same type. (There are millions of laptop and desktop
devices out there where this is the common case. As I said, I may have four
USB ports and the ability to plug hubs into them, but you can't add another
SATA hard drive to my laptop without a soldering iron.)

However, as soon as a device _is_ routed through the scsi layer (as PATA was a
few versions back), it gets conflated with numerous other devices. This
creates problems. SATA isn't hard to enmerate in my laptop, USB potentially
is. Dumpting all the SATA devices into the same bucket with the USB devices
makes both harder to enumerate.

> > This has the annoying effect of bundling together different types of
> > devices and making device enumeration unnecessarily difficult: my
> > laptop only has one SATA hard drive and can't gain another without a
> > soldering iron, but that drive could move from /dev/sda to /dev/sdb
> > if I reboot the system with a USB key plugged in. This seems like a
> > regrettable loss of orthogonality to me. I remember back when
> > /dev/usb0 and /dev/hda were separate devices that showed up in /dev,
> > but these days "it's SCSI" seems to trump "it's USB", "it's ATA", or
> > "it's SATA". (Even though none of those are actually SCSI hardware,
> > they just send a similar packet protocol across the wire.)
>
> You're showing your ignorance here.

I have buckets of ignorance. It's why I ask questions.

> In fact in the past few years,
> ATA and SCSI has been converging significantly,

And down far enough all these devices are powered by electricity. Are we
going to wind up with /dev/electric[1-999]?

SATA != PATA != USB. But /dev/sda can be PATA, /dev/sdb SATA, and /dev/sdc
USB. And they can move relative to each other. This didn't used to be the
case. Why is it considered an improvement?

> with the ATAPI
> specification has essentially incorporating the SCSI protocol by
> reference and by value --- with the point that SAS was developed by
> the SCSI Trade Association, and SAS is effectively a superset of SATA,
> to the point where with care, you can actually mix SAS and SATA drives
> on the same in enclosure (SAS and SATA are physically compatible on
> the connector level).

I'm aware of this, and under the impression they're both modified gigabit
ethernet at the PHY level. Should the hard drive become eth2?

> More to the point, with SATA, hot plugging has been designed in, so
> probing order is not going to be well defined,

The spec may define the capability to hotplug, but your average laptop doesn't
not offer the capability to hotplug anything into its SATA controllers. The
hard drive is screwed in (due to the portability part of laptopness), all the
controllers wired onto the motherboard are accounted for, none are exposed
externally. What _is_ exposed externally is USB, and if you want to add an
extra hard drive you can buy a cheap USB one at Fry's.

In such a case, which is common, the first SATA hard drive is reliably the
disk containing the root partition, and there's no need to stick a UUID
in /etc/fstab.

The problem is, "the first SATA hard drive" is not a stable identifier in a
system where SATA and USB devices are dumped in the same bucket and given a
big stir. Dumping SATA and USB devices into the same bucket (because they
smell a bit like SCSI) is what I am objecting to.

> just as with USB
> devices. And there are already relatively common situations where the
> same disk can show up via multiple different interfaces.

It was also possible to buy a hotplug PATA ide enclosure. So what? The vast
majority of traditional IDE users happily ignored this, and went on with
their lives.

> For example, if you have a modern Thinkpad with an secondary SATA hard
> drive in an Ultrabay, and you plug it into the Ultrabay in your T60,
> it will show up as a SATA drive.

I remember the config option about enumerating onboard IDE controllers first.
It didn't really matter what order they were enumerated in as long as it was
controllable.

Presumably if the primary SATA hard drive was /dev/sata and the slot
with "secondary" in its name got /dev/satb, life would be good. And the
presence or absence of /dev/satb wouldn't affect USB devices and such if they
weren't in the same namespace.

> However, if you plug it into the
> Advanced dock, it shows up as a USB device.

You plug it in somewhere else, it shows up somewhere else. This sounds
familiar to old IDE users. :)

How is it harder for udev to make a stable symlink for this drive that
sometimes points to /dev/satb and sometimes to /dev/usb1? (Harder than a
symlink that sometimes points to /dev/sdb and sometimes to /dev/sdd? You
don't have persistent naming _now_, so the objection seems to be that
maintaining the distinction between device types would not be a perfect
solution in all cases. I agree. So?)

> And with iSCSI not only
> can you encapsulate a SCSI command stream over USB, you can do so over
> IP as well.

Yup. And you've been able to make a network block device for years. They
showed up as /dev/nd0, a distinct type of block device which you (and your
scripts) could find. Now yet another way of doing the same thing is mixed
into the same scsi bucket and given a stir...

> In any case, regardless of how the physical SATA drive is
> attached to the system, you want it to show up as the same device and
> be mounted in the same location.

If my laptop's hard drive reliably showed up as /dev/sda every time, and I
could count on that, I wouldn't be complaining about it. The entire problem
is that it isn't guaranteed to do that, and thus /etc/fstab is a nightmare I
can't edit.

You could meet this definition of "the same" by having every block device in
the system show up as /dev/block[a-z] no matter what type it was, and all the
char devices show up as /dev/char[aa-zz], shuffle them all each reboot, and
then have all the programs iterate through all of them any time they wanted
something specific.

I'm rather glad that /dev/ttyS0 and /dev/zero aren't easy to mix up.

> That's why identifying filesystem by UUID's or Labels is so critical.
> This is not a new concept; we've had the capability to do this for
> over a decade, and I always knew it would be necessary for us to do
> this sooner or later --- which is why I added the UUID support to ext2
> back in 1996.

It's necessary for IBM big iron to do this. It's generally not necessary for
laptops or embedded systems to do this if they distinguish between _types_ of
devices, which is something they until recently did for the types of devices
I was interested in, and something they _stopped_ doing when everything got
merged into the scsi layer, and I consider this a regression.

No, distinguishing between types of devices is not a perfect solution to
device enumeration, but it was sufficient for all my use cases for many
years, and would still be if the kernel still did it, and I'm not alone here.

> > The fact that udev can theoretically unwind this hairball is not an
> > excuse for conflating different categories of devices in the first
> > place.
>
> See the thinkpad Ultrabay drive example above.

Last week I drove my laptop so deep into swap (with a "make -j" on qemu) that
after half an hour trying to repaint my kmail window, it locked solid.
Again. You'd think the oom killer would come to the rescue, but it didn't.
Maybe Ubuntu disabled it. I have _2_gigs_ of ram in this sucker, on a stock
Ubuntu 7.04 install (with the "upgrade all" tab pressed a few times), and yet
I managed to make it swap itself to death one more time.

Virtual memory isn't perfect. I've _always_ been able to come up with
examples where it just doesn't work for me. This doesn't mean VM overcommit
should be abolished, because it's useful more often than not.

So you have a counterexample. Ok. I can't actually see how your
counterexample would be worse off than it is now; just differently worse off.

> You address hosts by
> IP address; it doesn't matter whether you access them via a PPP
> interface, or a wireless interface, or a ethernet interface.

It does when I'm configuring the interfaces.

> Similarly, a disk could in theory be accessible over USB, SATA, or
> iSCSI, and the Thinkpad example is only one such where the same
> filesystem might be accessible over multiple interfaces. And with
> multipath fiber channel SAN's (and I hate to break it to you, but FC
> also uses SCSI protocols) storage is very much looking more and more
> like networking.

And in the networking world I'm able to say that this local machine has a
static IP that is not world-routable. It is separate, manually configured, I
put it _right_here_, and I personally know that it's not going to move
because I'm the one who put it there and I'm the only one who would move it.

Over on the networking side of things I can "ifconfig lo 127.0.0.1" without
first probing all the interfaces to figure out which one's loopback and which
one's the wireless card.

I note that the eth0 and eth1 names are dynamically assigned on a first come
first serve basis (like scsi). This never causes me a problem because the
driver loading order is constant, and once you figure out that eth0 is
gigabit and eth1 is the 80211g it _stays_ that way across reboots, reliably.
Yeah, it's a heuristic. Hands up everybody relying on such a heuristic in
the real world.

Possibly one solution here is to document that the SATA drivers load before
any other scsi device, and the driver subsystem _waits_ for that to finish
enumerating before trying any other kind of scsi device, with a barrier of
some kind), and then any SATA devices present at boot time will reliably get
those names in that order (no races, no variation) and anything after that is
a separate problem. (Of course this would involve making it true if it
currently isn't. It's still a mess to dump all sorts of different devices in
the same namespace, but at least for the common case of a laptop with a SATA
root partition this would let us get the UUID out of /etc/fstab).

> - Ted

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 08:27:45

by Nick Piggin

[permalink] [raw]
Subject: OOM killer gripe (was Re: What still uses the block layer?)

On Monday 15 October 2007 18:04, Rob Landley wrote:
> On Sunday 14 October 2007 8:45:03 pm Theodore Tso wrote:

> > > excuse for conflating different categories of devices in the first
> > > place.
> >
> > See the thinkpad Ultrabay drive example above.
>
> Last week I drove my laptop so deep into swap (with a "make -j" on qemu)
> that after half an hour trying to repaint my kmail window, it locked solid.
> Again. You'd think the oom killer would come to the rescue, but it didn't.
> Maybe Ubuntu disabled it. I have _2_gigs_ of ram in this sucker, on a
> stock Ubuntu 7.04 install (with the "upgrade all" tab pressed a few times),
> and yet I managed to make it swap itself to death one more time.
>
> Virtual memory isn't perfect. I've _always_ been able to come up with
> examples where it just doesn't work for me. This doesn't mean VM
> overcommit should be abolished, because it's useful more often than not.

I hate to go completely offtopic here, but disks are so incredibly
slow when compared to RAM that there is really nothing the kernel
can do about this. Presumably the job will finish, given infinite
time.

How much swap do you have configured? You really shouldn't configure
so much unless you do want the kernel to actually use it all, right?
Because if we're not really conservative about OOM killing, then the
user who actually really did want to use all the swap they configured
gets angry when we kill their jobs without using it all.

Would an oom-kill-someone-now sysrq be of help, I wonder?

2007-10-15 08:37:44

by Luben Tuikov

[permalink] [raw]
Subject: Re: What still uses the block layer?

--- Rob Landley <[email protected]> wrote:
> On Sunday 14 October 2007 7:45:46 pm Luben Tuikov wrote:
> > Matthew's expletive and extremely rude response really shows
> > the general attitude of the linux-scsi people.
>
> No, it doesn't. James Bottomley has been exceedingly polite and helpful, as
> were several other people on the linux-scsi list when I asked them about this
> stuff back in August.

I wasn't referring to him specifically. He also stepped into the WWN
thread in the same manner as he did in your thread.

> Religion, politics, and anything remotely related to hotplug appear to be
> topics to avoid in polite company if you want it to remain polite. (My
> gripes with scsi mostly have to do with device enumeration. My attempts to
> use sysfs also have to do with device enumeration. I've spotted a trend
> here.)
>
> Rob
> --
> "One of my most productive days was throwing away 1000 lines of code."
> - Ken Thompson.
>

2007-10-15 08:52:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sun, Oct 14, 2007 at 11:00:15PM -0700, Greg KH wrote:
> If you hate USB storage devices using scsi, please use the ub driver,
> that is what it was written for.

The ub driver is a really dumb piece of shit. It only drivers usb storage
devices using a scsi protocol set, and duplicates the scsi stack in a very
suboptimal way.

2007-10-15 09:06:39

by Julian Calaby

[permalink] [raw]
Subject: Re: What still uses the block layer?

On 10/15/07, Rob Landley <[email protected]> wrote:
> I note that the eth0 and eth1 names are dynamically assigned on a first come
> first serve basis (like scsi). This never causes me a problem because the
> driver loading order is constant, and once you figure out that eth0 is
> gigabit and eth1 is the 80211g it _stays_ that way across reboots, reliably.
> Yeah, it's a heuristic. Hands up everybody relying on such a heuristic in
> the real world.

Umm, not quite, from my experiences with pre-production wireless
drivers, (another story, another time) fancy stuff is being done in
udev to make sure that your gigabit card is always assigned to eth0.

--

Julian Calaby

Email: [email protected]

2007-10-15 09:26:25

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 1:00:15 am Greg KH wrote:
> If you hate USB storage devices using scsi, please use the ub driver,
> that is what it was written for.

For the embedded space, the ability to configure out the scsi layer is
interesting from a size perspective. I bookmarked that a while back, but had
forgotten about it. Thanks for the reminder.

For the desktop I don't object to the scsi layer. I object to the naming.
Merging a half-dozen different types of devices into a single name space, and
then warning us that the order they appear within that namespace could be the
result of race conditions... Seems like an artificially inflated problem to
me. Don't merge them together and each namespace is a smaller problem, often
with only a single device or with a stable relationship between the devices.

(That said, the answer to my original question, "is the block layer still in
use" seems to be yes, so creating a 00-INDEX for Documentation/block is a
good thing, and I'll go do that. I acknowledge that I asked this question
_horribly_, due to having other unresolved issues with the scsi layer...)

> When did usb-storage devices ever show up as /dev/usb0? USB flash disks
> are really SCSI devices, look at the USB storage spec for proof of that.

Um, possibly I _was_ playing with the ub driver and got a /dev/ub0. (I
vaguely recall playing with back around... February? When did it wander
across Pavel's blog... I don't actually remember if I got it to work or
not.) Possibly this is from playing with a usb scanner back around 2004. (I
just dragged out my other USB device from that period, an ethernet dongle,
but it doesn't create /dev anything. Just shows up as usb2. :)

The point I was trying to make is that it seems to me like it would be
possible to keep the namespace separate here, and thus reduce the enumeration
problems to the point where common cases (like my laptop) aren't impacted by
them during early boot. I don't think anybody (outside the embedded space)
is actually upset that /dev/hda now goes through the scsi layer: they're
upset Ubuntu 7.04 no longer calls it /dev/hda.

> thanks,
>
> greg k-h

Thank you,

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 09:26:42

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 12:44:19 am Stefan Richter wrote:
> Rob Landley wrote:
> > I was at least attempting to ask a serious question.
>
> ...
>
> > Actually, I was going through Documentation/block thinking about making a
> > 00-INDEX for it, but my earlier questions of the scsi guys left me with
> > the impression that the block layer is _not_ used by the SCSI layer.
>
> Ah, so it was about your documentation work.

Well, triggered by. (This documentation stuff makes me poke into corners of
the kernel I ordinarily otherwise avoid, for various reasons. I don't
currently have the luxury of saying "beats me how this bit works, not my
area".)

> I already forgot the
> context of your previous inquiries. Alas the tone of them already did
> some damage, leading to responses like these.

Sorry about that. My social skills are finite, I tend to exhaust them when I
do too much at once. :(

The resulting documentation should be very polite and apolitical. :)

> > since
> > every non-embedded modern storage device I'm aware of has been consumed
> > by the SCSI layer (despite none of them actually having a discernably
> > closer relationship to SCSI than ATA did)
>
> The Linux SCSI subsystems don't consume, they provide services; nowadays
> not only for SCSI hardware and SCSI protocols but also for a number of
> subsystems whose tasks are similar enough to SCSI subsystems to make the
> SCSI core and upper SCSI layer useful to them too.

This discussion has clarified for me that my objection isn't the scsi layer
itself, it's the /dev/sd? namespace combining devices that would otherwise
be /dev/hda, /dev/nd0, /dev/ub0 (or usb0 or some such), and /dev/sata into a
single linear namespace that's unreliably ordered.

> BTW:
> | Now that IDE disks have been rerouted through the scsi layer, SATA goes
> | through the scsi layer, USB goes through the scsi layer, firewire goes
> | through the scsi layer...
>
> As a side note, SBP-2 is a SCSI transport protocol, hence ieee1394/sbp2
> and firewire/fw-sbp2 are Linux SCSI low-level drivers. Anything else
> would be just wrong and infeasible in this particular case.

My "scsi mid layer" vs "block layer" question was about whether I should read
up on the block layer if the scsi mid layer didn't use it. Neil Brown just
sent me a nice email (which I'll have to reread in the morning when I'm more
awake) that helps there.

The "ide/sata/usb/firewire->scsi" complaint didn't belong in the same email as
the original question, it's a line of questioning I put on hold on linux-scsi
back in August when the thread started getting a bit heated for my tastes.

To clarify, I think that merging ide, sata, usb, firewire, and others into a
single device namespace causes each type of device to inherit that
namespace's cumulative ordering issues, which is a bad thing. I have no real
attachment to the underlying scsi or block layers. I've never seriously
worked on either (although I'm trying to understand both).

For example, usb devices are never easy to order. IDE devices (back when they
had their own namespace) were trivial to order back when /dev/hda couldn't
move without use of a screwdriver. USB and IDE devices are very different in
that it's not possible to plug a USB device into an IDE controller (not
without one _heck_ of an adapter) and vice versa. USB devices usually live
outside the computer's case, and IDE devices inside the case. They're not
the same thing.

Combining USB and IDE into the same /dev/sd? namespace makes enumerating the
IDE devices much harder than in the traditional "/dev/hdb doesn't move
without a screwdriver" model. The merger creates a new problem for IDE, one
which didn't exist before: the addition or removal of other unrelated types
of devices may change this device's location next boot. It may be possible
to add additional complication to the system to compensate, but what was the
advantage of merging the namespaces in the first place?

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 09:52:46

by Rob Landley

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Monday 15 October 2007 8:37:44 am Nick Piggin wrote:
> > Virtual memory isn't perfect. I've _always_ been able to come up with
> > examples where it just doesn't work for me. This doesn't mean VM
> > overcommit should be abolished, because it's useful more often than not.
>
> I hate to go completely offtopic here, but disks are so incredibly
> slow when compared to RAM that there is really nothing the kernel
> can do about this.

I know.

> Presumably the job will finish, given infinite
> time.

I gave it about half an hour, then it locked solid and stopped writing to the
disk at all. (I gave it another 5 minutes at that point, then held down the
power button.)

Lost about 50 open konqueror tabs...

> How much swap do you have configured?

2 gigs, same as ram.

> You really shouldn't configure
> so much unless you do want the kernel to actually use it all, right?

Two words: "Software suspend". I've actually been thinking of increasing it
on the next install...

> Because if we're not really conservative about OOM killing, then the
> user who actually really did want to use all the swap they configured
> gets angry when we kill their jobs without using it all.

I tend to lower "swappiness" and when that happens all sorts of stuff goes
weird. Software suspend used to say says it can't free enough memory if I
put swappiness at 0 (dunno if it still does). This time the OOM killer never
triggered before hard deadlock. (I think I had it around 20 or 40 or some
such.)

> Would an oom-kill-someone-now sysrq be of help, I wonder?

*shrug* It might. I was a letting it run hoping it would complete itself when
it locked solid. (The keyboard LEDs weren't flashing, so I don't _think_ it
paniced. I was in X so I wouldn't have seen a message...)

(To be honest, I can never remember how to trigger sysrq on a laptop keyboard.
Presumably X won't intercept it the way it does alt-f1 and ctrl-alt-del...)

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 09:58:41

by Nick Piggin

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Monday 15 October 2007 19:52, Rob Landley wrote:
> On Monday 15 October 2007 8:37:44 am Nick Piggin wrote:
> > > Virtual memory isn't perfect. I've _always_ been able to come up with
> > > examples where it just doesn't work for me. This doesn't mean VM
> > > overcommit should be abolished, because it's useful more often than
> > > not.
> >
> > I hate to go completely offtopic here, but disks are so incredibly
> > slow when compared to RAM that there is really nothing the kernel
> > can do about this.
>
> I know.
>
> > Presumably the job will finish, given infinite
> > time.
>
> I gave it about half an hour, then it locked solid and stopped writing to
> the disk at all. (I gave it another 5 minutes at that point, then held
> down the power button.)

Maybe it was a bug then. Hard to say without backtraces ;)


> > You really shouldn't configure
> > so much unless you do want the kernel to actually use it all, right?
>
> Two words: "Software suspend". I've actually been thinking of increasing
> it on the next install...

Kernel doesn't know that you want to use it for suspend but not
regular swapping, unfortunately.


> > Because if we're not really conservative about OOM killing, then the
> > user who actually really did want to use all the swap they configured
> > gets angry when we kill their jobs without using it all.
>
> I tend to lower "swappiness" and when that happens all sorts of stuff goes
> weird. Software suspend used to say says it can't free enough memory if I
> put swappiness at 0 (dunno if it still does). This time the OOM killer
> never triggered before hard deadlock. (I think I had it around 20 or 40 or
> some such.)
>
> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>
> *shrug* It might. I was a letting it run hoping it would complete itself
> when it locked solid. (The keyboard LEDs weren't flashing, so I don't
> _think_ it paniced. I was in X so I wouldn't have seen a message...)

If you can work out where things are spinning/sleeping when that happens,
along with sysrq+M data, then it could make for a useful bug report. Not
entirely helpful, but if it is a reproducible problem for you, then you
might be able to get that data from outside X.

2007-10-15 10:08:51

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 4:06:20 am Julian Calaby wrote:
> On 10/15/07, Rob Landley <[email protected]> wrote:
> > I note that the eth0 and eth1 names are dynamically assigned on a first
> > come first serve basis (like scsi). This never causes me a problem
> > because the driver loading order is constant, and once you figure out
> > that eth0 is gigabit and eth1 is the 80211g it _stays_ that way across
> > reboots, reliably. Yeah, it's a heuristic. Hands up everybody relying on
> > such a heuristic in the real world.
>
> Umm, not quite, from my experiences with pre-production wireless
> drivers, (another story, another time) fancy stuff is being done in
> udev to make sure that your gigabit card is always assigned to eth0.

I remember building a 2.4 kernel, statically linking in all the drivers, and
getting the ethernet devices showing up in a reliable order for years. Where
does the need for fancy stuff come in?

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 10:32:47

by Loïc Grenié

[permalink] [raw]
Subject: Re: What still uses the block layer?

2007/10/15, Rob Landley <[email protected]>:
> On Sunday 14 October 2007 8:45:03 pm Theodore Tso wrote:
>> On Sun, Oct 14, 2007 at 06:45:44PM -0500, Rob Landley wrote:
>>> I admit a certain amount of personal annoyance that once the SCSI
>>> layer consumes a category of device (USB, SATA, PATA), they can
>>> often _only_ be used by going through the SCSI midlayer. (This
>>> strikes me as analogous to TCP/IP claiming ethernet and PPP devices
>>> so thoroughly that you can no longer address them as eth1 or
>>> /dev/ttyS0.)
>>
>> That's because modern USB, ATAPI (what was once known as IDE), SATA
>> really *all* using the SCSI command protocols at the low level,
>
> Ok, I'll bite. If it's all "real" scsi, why does ioctl(SG_EMULATED_HOST)
> exist? exist if it's all "real" scsi?

How do you define real SCSI ? The definition of SCSI in the kernel is
"a device that accept the SCSI command set" (more precisely "a
suitably large subset a the SCSI command set". It looks as if you
definition of SCSI is "a device that is sold with written SCSI on the
box and that attaches to a card with SCSI written on the box"; is it
correct ?

The host is the expansion card that connects the device to the
motherboard. If it is emulated this means that it is not a native
SCSI host. In case of USB drives/keys this is probably the case.

>> just as Ethernet and PPP interfaces really are fundamentally the
>> same thing.
>
> They're the same thing?
>
> Do you mean that on a system with both, going:
> ifconfig eth1 66.92.53.140
> ifconfig ppp 192.168.0.42
>
> Would be functionally equivalent to:
> ifconfig eth1 192.168.0.42
> ifconfig ppp 66.92.53.140
>
> So if on one boot the addresses are assigned the first way, and upon reboot
> they're assigned in the second way by exact the same set of commands... well
> that's not IMPORTANT, is it? (Or is it that everyone everywhere should use
> dhcp for everything, and static addressing is obsolete and no longer
> supported?

You are really looking like you are out for a fight.

> Apparently dhcp addresses should be delivered by machines with
> only one network interface of any type...)

I don't understand this one.

> This is my objection. Even when enumerating multiple devices of the same type
> is tricky, enumerating multiple devices of _different_ types should not be.
> There's a great big type indicator that is being _deliberately_ ignored, and
> large classes of devices (millions of laptops) where you know there's only
> going to be _one_ instance of a given type.

Your objection is interesting. It is lost in the middle of e-mails which,
to the untrained eye, look like you are trying to fight everyone and
everybody.

> By the way, ethernet cards contain a unique MAC address. Hard drives do not
> seem to, or if they do it's not being consistently exposed in a way I can
> find. This is sad. (No, reading data from the device to determine this gets
> us back to the "spinning up the external USB drive to find my root partition"
> gripe mentioned earlier.)

As far as I can tell the hard drives do not have serial numbers easily
readable by the kernel (I think it's only printed on the label). However
(feverishly plugging his USB key in the laptop), you can tell how a drive
is attached to the motherboard:

Laptop's SATA drive:
cognac $ readlink /sys/block/sda/device
../../devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0

USB key:
coghac $ readlink /sys/block/sdb/device
../../devices/pci0000:00/0000:00:13.5/usb6/6-3/6-3:1.0/host4/target4:0:0/4:0:0:0

By the way, did you look in /dev/disk/by-id (udev magic) ? It's probably
not very difficult to reconfigure udevd to not read the UUIDs of the
partitions and not spin up your holy external disk at each reboot. I think
the one that is spinning up your holy external hard drive is udevd. By
the way, how many time do you reboot instead of resuming from
suspend-to-disk ? Have you given a try to TuxOnIce ?

If you had asked your first question in a way similar to this one:

"I have my laptop hard drive that shows as different devices depending
whether there are USB drives plugged in or not, what should I do ?
Shouldn't SATA/USB drives/PATA/iSCSI drives be enumerated in different
queues ?"

You would probably have received more interesting answers and less
insults.

>> You can rail against it, but that's the mark of someone who
>> refuses to accept reality.
>
> Let me clarify: I'm talking about device enumeration.
>
> I've never had trouble enumerating a device that was _not_ routed through the
> scsi layer, largely because the systems I work with don't usually have more
> than one device of the same type. (There are millions of laptop and desktop
> devices out there where this is the common case. As I said, I may have four
> USB ports and the ability to plug hubs into them, but you can't add another
> SATA hard drive to my laptop without a soldering iron.)
>
> However, as soon as a device _is_ routed through the scsi layer (as PATA was a
> few versions back), it gets conflated with numerous other devices. This
> creates problems. SATA isn't hard to enmerate in my laptop, USB potentially
> is. Dumping all the SATA devices into the same bucket with the USB devices
> makes both harder to enumerate.

Indeed. Propose a solution. Remember that it is indispensable that all
goes through the SCSI layer(s) because all those devices respond to the
same command set thus you do not want several implementations of
the routines.

>>> This has the annoying effect of bundling together different types of
>>> devices and making device enumeration unnecessarily difficult: my
>>> laptop only has one SATA hard drive and can't gain another without a
>>> soldering iron, but that drive could move from /dev/sda to /dev/sdb
>>> if I reboot the system with a USB key plugged in. This seems like a
>>> regrettable loss of orthogonality to me. I remember back when
>>> /dev/usb0 and /dev/hda were separate devices that showed up in /dev,
>>> but these days "it's SCSI" seems to trump "it's USB", "it's ATA", or
>>> "it's SATA". (Even though none of those are actually SCSI hardware,
>>> they just send a similar packet protocol across the wire.)
>>
>> You're showing your ignorance here.
>
> I have buckets of ignorance. It's why I ask questions.

Once again. You are so aggressive in your asking that it does not
lead to an interesting discussion.

>> In fact in the past few years,
>> ATA and SCSI has been converging significantly,
>
> And down far enough all these devices are powered by electricity. Are we
> going to wind up with /dev/electric[1-999]?

Out for a fight ?

> SATA != PATA != USB. But /dev/sda can be PATA, /dev/sdb SATA, and /dev/sdc
> USB. And they can move relative to each other. This didn't used to be the
> case. Why is it considered an improvement?

Each device is different from each other (they do not share their atoms).
Where do you want to put the line between using a single driver for them
or not ?

In that case the sd driver registers all disk-like devices that respond to
(a suitably large subset of) the SCSI command set. It is an improvment
to have all devices share a driver because if you improve it, you improve
it for all the devices; if you debug it, you debug it for all your
devices; you
use less memory.

The enumeration of the devices is not the nightmare you are trying to
imply. If it is hard in your particular case, many people are likely to want
to help you. Just try to ask politely, without shouting, without saying that
block layer is useless, without saying that device enumeration in the
SCSI layer (or sd device driver) is braindead. If you want to propose a
change, propose it: "we could also do it that other way". "The way the
SCSI disk is attached should show in the name of the device". I suspect
this will be refused anyway because that would mean that you need
a (series of) major block number(s) for each type of SCSI attachment
(if the device has not a different major block number, there is nothing
short of udev that can give it a different name).

>> with the ATAPI
>> specification has essentially incorporating the SCSI protocol by
>> reference and by value --- with the point that SAS was developed by
>> the SCSI Trade Association, and SAS is effectively a superset of SATA,
>> to the point where with care, you can actually mix SAS and SATA drives
>> on the same in enclosure (SAS and SATA are physically compatible on
>> the connector level).
>
> I'm aware of this, and under the impression they're both modified gigabit
> ethernet at the PHY level. Should the hard drive become eth2?

Out for a fight ?

>> More to the point, with SATA, hot plugging has been designed in, so
>> probing order is not going to be well defined,
>
> The spec may define the capability to hotplug, but your average
> laptop doesn't not offer the capability to hotplug anything into its
> SATA controllers.

How long before eSATA enabled laptops (with eSATA enumerated
before SATA obviously) ?

> The hard drive is screwed in (due to the portability part of laptopness),
> all the controllers wired onto the motherboard are accounted for, none
> are exposed externally. What _is_ exposed externally is USB, and if you
> want to add an extra hard drive you can buy a cheap USB one at Fry's.
>
> In such a case, which is common, the first SATA hard drive is reliably the
> disk containing the root partition, and there's no need to stick a UUID
> in /etc/fstab.
>
> The problem is, "the first SATA hard drive" is not a stable identifier in a
> system where SATA and USB devices are dumped in the same bucket
> and given big stir. Dumping SATA and USB devices into the same
> bucket (because they smell a bit like SCSI) is what I am objecting to.

You should have told it in the first place -- with cooler tone.

>> just as with USB
>> devices. And there are already relatively common situations where the
>> same disk can show up via multiple different interfaces.
>
> It was also possible to buy a hotplug PATA ide enclosure. So what? The vast
> majority of traditional IDE users happily ignored this, and went on with
> their lives.
>
> > For example, if you have a modern Thinkpad with an secondary SATA hard
> > drive in an Ultrabay, and you plug it into the Ultrabay in your T60,
> > it will show up as a SATA drive.
>
> I remember the config option about enumerating onboard IDE controllers first.
> It didn't really matter what order they were enumerated in as long as it was
> controllable.
>
> Presumably if the primary SATA hard drive was /dev/sata and the slot
> with "secondary" in its name got /dev/satb, life would be good. And the
> presence or absence of /dev/satb wouldn't affect USB devices and such if they
> weren't in the same namespace.
>
>> However, if you plug it into the
>> Advanced dock, it shows up as a USB device.
>
> You plug it in somewhere else, it shows up somewhere else. This sounds
> familiar to old IDE users. :)
>
> How is it harder for udev to make a stable symlink for this drive that
> sometimes points to /dev/satb and sometimes to /dev/usb1? (Harder than a
> symlink that sometimes points to /dev/sdb and sometimes to /dev/sdd? You
> don't have persistent naming _now_, so the objection seems to be that
> maintaining the distinction between device types would not be a perfect
> solution in all cases. I agree. So?)
>
> > And with iSCSI not only
> > can you encapsulate a SCSI command stream over USB, you can do so over
> > IP as well.
>
> Yup. And you've been able to make a network block device for years. They
> showed up as /dev/nd0, a distinct type of block device which you (and your
> scripts) could find. Now yet another way of doing the same thing is mixed
> into the same scsi bucket and given a stir...
>
> > In any case, regardless of how the physical SATA drive is
> > attached to the system, you want it to show up as the same device and
> > be mounted in the same location.
>
> If my laptop's hard drive reliably showed up as /dev/sda every time, and I
> could count on that, I wouldn't be complaining about it. The entire problem
> is that it isn't guaranteed to do that, and thus /etc/fstab is a nightmare I
> can't edit.
>
> You could meet this definition of "the same" by having every block device in
> the system show up as /dev/block[a-z] no matter what type it was, and all the
> char devices show up as /dev/char[aa-zz], shuffle them all each reboot, and
> then have all the programs iterate through all of them any time they wanted
> something specific.
>
> I'm rather glad that /dev/ttyS0 and /dev/zero aren't easy to mix up.
>
> > That's why identifying filesystem by UUID's or Labels is so critical.
> > This is not a new concept; we've had the capability to do this for
> > over a decade, and I always knew it would be necessary for us to do
> > this sooner or later --- which is why I added the UUID support to ext2
> > back in 1996.
>
> It's necessary for IBM big iron to do this. It's generally not necessary for
> laptops or embedded systems to do this if they distinguish between _types_ of
> devices, which is something they until recently did for the types of devices
> I was interested in, and something they _stopped_ doing when everything got
> merged into the scsi layer, and I consider this a regression.
>
> No, distinguishing between types of devices is not a perfect solution to
> device enumeration, but it was sufficient for all my use cases for many
> years, and would still be if the kernel still did it, and I'm not alone here.
>
> > > The fact that udev can theoretically unwind this hairball is not an
> > > excuse for conflating different categories of devices in the first
> > > place.
> >
> > See the thinkpad Ultrabay drive example above.
>
> Last week I drove my laptop so deep into swap (with a "make -j" on qemu) that
> after half an hour trying to repaint my kmail window, it locked solid.
> Again. You'd think the oom killer would come to the rescue, but it didn't.
> Maybe Ubuntu disabled it. I have _2_gigs_ of ram in this sucker, on a stock
> Ubuntu 7.04 install (with the "upgrade all" tab pressed a few times), and yet
> I managed to make it swap itself to death one more time.
>
> Virtual memory isn't perfect. I've _always_ been able to come up with
> examples where it just doesn't work for me. This doesn't mean VM overcommit
> should be abolished, because it's useful more often than not.
>
> So you have a counterexample. Ok. I can't actually see how your
> counterexample would be worse off than it is now; just differently worse off.
>
> > You address hosts by
> > IP address; it doesn't matter whether you access them via a PPP
> > interface, or a wireless interface, or a ethernet interface.
>
> It does when I'm configuring the interfaces.
>
> > Similarly, a disk could in theory be accessible over USB, SATA, or
> > iSCSI, and the Thinkpad example is only one such where the same
> > filesystem might be accessible over multiple interfaces. And with
> > multipath fiber channel SAN's (and I hate to break it to you, but FC
> > also uses SCSI protocols) storage is very much looking more and more
> > like networking.
>
> And in the networking world I'm able to say that this local machine has a
> static IP that is not world-routable. It is separate, manually configured, I
> put it _right_here_, and I personally know that it's not going to move
> because I'm the one who put it there and I'm the only one who would move it.
>
> Over on the networking side of things I can "ifconfig lo 127.0.0.1" without
> first probing all the interfaces to figure out which one's loopback and which
> one's the wireless card.
>
> I note that the eth0 and eth1 names are dynamically assigned on a first come
> first serve basis (like scsi). This never causes me a problem because the
> driver loading order is constant, and once you figure out that eth0 is
> gigabit and eth1 is the 80211g it _stays_ that way across reboots, reliably.
> Yeah, it's a heuristic. Hands up everybody relying on such a heuristic in
> the real world.

It' also easier because you have the MAC address to help.

> Possibly one solution here is to document that the SATA drivers load
> before any other scsi device, and the driver subsystem _waits_ for
> that to finish enumerating before trying any other kind of scsi device,
> with a barrier of some kind), and then any SATA devices present at
> boot time will reliably get those names in that order (no races, no
> variation) and anything after that is a separate problem. (Of course
> this would involve making it true if it currently isn't. It's still a mess to
> dump all sorts of different devices in the same namespace, but at
> least for the common case of a laptop with a SATA root partition this
> would let us get the UUID out of /etc/fstab).

There is also the need to ensure that each distribution loads the ahci
driver before any other SCSI or emulated SCSI host. It looks as if Ubuntu
does not do it that way (I've tried to understand the module probing logic
in the initramfs but have miserably failed; too much magic probably).

Remember that udevd starts your holy external hard disk, not the
UUID in /etc/fstab (obiously if you have an UUID in fstab, you need udev
to probe for it, but it'll do it regardless of whether or not UUSID is present
in fstab.

Regards,

Lo?c

2007-10-15 11:20:42

by NeilBrown

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday October 15, [email protected] wrote:
>
> This is my objection. Even when enumerating multiple devices of the same type
> is tricky, enumerating multiple devices of _different_ types should not be.
> There's a great big type indicator that is being _deliberately_ ignored, and
> large classes of devices (millions of laptops) where you know there's only
> going to be _one_ instance of a given type.

My perspective is different.

The range of addressing option for "all disk devices" is far too rich
to be able to assign a stable device number every device: there are
multiple, multi-dimensional addressing scheme, and some devices might
not even have a stable address at all (e.g. USB?).
So the reality of dealing with disk devices is that you cannot provide
a stable single-number naming scheme for all devices on all machines.

Therefore it is best to not have stable single-number naming schemes
for any devices on any machines. Why? Because it ensure there will
not be any second class citizens.

If some devices that are even reasonably common (e.g. IDE drives) are
stable, then some application developers or system integrators will
work under the assumption of stability and whatever they build will
break when you try it on different hardware. This happened during the
early days of SCSI support - code assumed the stability of
major/minor numbers and so did not work properly with SCSI which
cannot provide that stability in general.

Having a totally uniform approach makes development and testing a lot
easier - there are fewer special cases.

I would prefer that 'total uniformity' went even further than
/dev/sd?? to /dev/disk??. i.e. Anything that is or behaves
substantially like a disk drive should be "/dev/diskXX", where numbers
are assigned sequentially on discovery. (I wonder why we need
/dev/scdX to be separate from /dev/sdX).

Note that stable names a still a very real option. udev provides
several. /dev/disk-by-path/XXX will be stable for lots of "screwed
in" devices. /dev/disk-by-id will be stable for devices the report a
unique id. etc.

The different between IDE, SATA, SCSI and even USB is peripheral for
the large majority of uses, and I think maintaining the distinction in
the major/minor number or in the "primary" /dev name is - for the
above reasons - more of a cost that a value.

NeilBrown

2007-10-15 11:41:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Mon, Oct 15, 2007 at 11:37:44PM +1000, Nick Piggin wrote:
> I hate to go completely offtopic here, but disks are so incredibly
> slow when compared to RAM that there is really nothing the kernel
> can do about this. Presumably the job will finish, given infinite
> time.

About 6 weeks ago, on a 2.6.23-rc kernel, I accidentally typed "make
-j", and left off the 4 before I hit the return key. About 2-3
minutes later, the box locked pretty tight. I managed to switch to a
VT console before I lost total control of X (took many, many minutes
to do the switch), but after many minutes, managed to get logged into
the console, but I wasn't able to get a ps command to complete so I
could start killing processes. (I probably should have just done a
"killall make" right away, but hindsight is 20/20.)

The console was showing that the OOM killer was attempting to kill
processes, but apparently not fast enough to stem the tide of all of
the new processes getting generated by the make -j. (I'm guessing
that it was killing the gcc processes and not the make processes.)

> Would an oom-kill-someone-now sysrq be of help, I wonder?

I tried sysrq-f (oom_kill), but no dice. Given that the oom killer
was active and apparently triggering on its own, this wasn't all that
surprising.

The interesting thing is I tried to do an sysrq-e (send SIGTERM to all
processes except), waited 5 minutes or so, then tried an alt-sysrq-i
(send SIGKILL to all processes except init), and the system was still
thrashing itself to death, even after giving it plenty of time to try
to recover.

I finally gave up and held down the power button. This was on a box
with 4 gigs memory (but only 3 gigs visible thanks a cheap
BIOS/chipset) and 4 gigs swap (mainly intended for suspend/resume).

I chalked it up to me being stupid (I should have noticed and
Ctrl-C'ed the make -j much more quickly, or if I were a sysadmin on a
time-sharing system with users I didn't trust, configured RLIMIT_NPROC
and/or per-user container resource limits) and the OOM killer not
being aggressive enough in such a situation. But having better things
to do, I didn't go whining on LKML about it, although I have to say
that the kernel behavior isn't exactly ideal. One of these days when
I have time, I'll try investigating it with a few memlocked processes
running at real-time priorities and Systemtap and figure out what the
heck was going on....

I suppose I should just configure suspending to a file instead of a
swap partition, but I've just historically trusted suspend/resume to a
swap partition much more than to a file. Or maybe I should hack in a
sysctl to prevent any swapping even though the swap partition is
configured (so only suspend/resume will use it).

- Ted

2007-10-15 13:05:48

by Alan

[permalink] [raw]
Subject: Re: What still uses the block layer?

> For the desktop I don't object to the scsi layer. I object to the naming.
> Merging a half-dozen different types of devices into a single name space, and

They *are* SCSI devices. USB storage is a SCSI over USB transport. ATAPI
is a SCSI over ATA transport. SAS is much the same thing, as is FC, and
it continues.

With the exception of ATA disk for historical reasons SCSI essentially
won the battle of command formats.

> problems to the point where common cases (like my laptop) aren't impacted by
> them during early boot. I don't think anybody (outside the embedded space)
> is actually upset that /dev/hda now goes through the scsi layer: they're
> upset Ubuntu 7.04 no longer calls it /dev/hda.

For the emedded CF using world we could do with a truely dumb ATA only CF
driver, possibly even with pure polled support that used neither the IDE
or the ATA layer.

Alan

2007-10-15 13:11:00

by James Bottomley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Sun, 2007-10-14 at 18:45 -0500, Rob Landley wrote:
> On Sunday 14 October 2007 5:24:32 pm James Bottomley wrote:
> > On Sat, 2007-10-13 at 16:05 -0600, Matthew Wilcox wrote:
> > > On Thu, Oct 11, 2007 at 08:11:21PM -0500, Rob Landley wrote:
> > > > My impression from asking questions on the linux-scsi mailing list is
> > > > that the scsi upper/middle/lower layers doesn't use the block layer
> > > > described in Documentation/block/*.
> > >
> > > Entirely incorrect.
> >
> > OK, right ... could we please get a sense of decorum back on this list.
>
> Did I reply to the insult?
>
> > Rob, if you didn't ask your alleged questions in such a pejorative
> > manner, we'd get a lot further
>
> I'm not attempting to be pejorative.

OK, so could we get back to the original discussion? The question I
think you meant to ask is "does SCSI use the block layer, and if so;
how?"

The answer is yes (just do an ls /sys/block on any scsi machine). The
how is that it bascially uses the block layer as a service library (i.e.
most SCSI services are built on top of those already provided by block).
The email you cited was basically from our one area of confusion: SCSI
and block both provide services to decode the SG_IO ioctl. This is
partly historical; block and SCSI are very much intertwined; so much so
that they both tend to drive each other's development. The programme
over the last few years has been to identify features in SCSI that
should be more generic (and hence moved to block). SG_IO is one of
these, so we end up with the situation where Block provides this as a
service (and sr, st and sd make use of it) while the sg driver still
doesn't use what the block layer provides but rolls its own. I think
the layout of how all this works is illustrated at a reasonably high
level here on slide 15:

http://licensing.steeleye.com/support/papers/ols_2005_slides.pdf


> I admit a certain amount of personal annoyance that once the SCSI layer
> consumes a category of device (USB, SATA, PATA), they can often _only_ be
> used by going through the SCSI midlayer. (This strikes me as analogous to
> TCP/IP claiming ethernet and PPP devices so thoroughly that you can no longer
> address them as eth1 or /dev/ttyS0.)

OK. But that's the bit I need you to separate from your inquiry into
how SCSI actually works. You can't go on a research trip if you allow
preconceived notions to spill over into it.

For the record, USB and firewire are SCSI at their core, so they can
never really be separated. SATA (but not SATAPI) is a separate
protocol, so it can theoretically be separated later, and we are
actually working on that. It's only in SCSI because there's a well
defined and standardised way to place it their (called the SAT
layer---SCSI to ATA Translation) and because it's a lot easier since
SCSI has all the features and quite a few of the necessary ones aren't
yet migrated to block.

> This has the annoying effect of bundling together different types of devices
> and making device enumeration unnecessarily difficult: my laptop only has one
> SATA hard drive and can't gain another without a soldering iron, but that
> drive could move from /dev/sda to /dev/sdb if I reboot the system with a USB
> key plugged in. This seems like a regrettable loss of orthogonality to me.
> I remember back when /dev/usb0 and /dev/hda were separate devices that showed
> up in /dev, but these days "it's SCSI" seems to trump "it's USB", "it's ATA",
> or "it's SATA". (Even though none of those are actually SCSI hardware, they
> just send a similar packet protocol across the wire.)
>
> The fact that udev can theoretically unwind this hairball is not an excuse for
> conflating different categories of devices in the first place. Avoiding an
> unnecessary problem seems superior to trying to get udev to solve it. Note
> that Ubuntu 7.04 solves it by sticking a UUID on every _partition_, and then
> spinning up my external USB hard drive trying to find the root partition on a
> reboot. Tell me how this can be considered progress:
>
> > # /etc/fstab: static file system information.
> > #
> > # <file system> <mount point> <type> <options> <dump> <pass>
> > proc /proc proc defaults 0 0
> > # /dev/sda1
> > UUID=04d1b984-bd65-46f1-9a77-c158cf4bed1b / ext3
> defaults,errors=remount-ro,noatime 0 1
> > # /dev/sda5
> > UUID=cdf0936d-9f19-42c6-b131-9fefcf1321ef none swap sw
> 0 0
> > /dev/scd0 /media/cdrom0 udf,iso9660 user,noauto 0 0
> > UUID=86bbb512-ab7e-4a12-8618-1190f032c082 /boot ext3 defaults 0 0
>
> Conflating categories of hardware that cannot easily be enumerated (USB) with
> categories that can (the SATA hard drive in my laptop, of which there can be
> only one) strikes me as a bad thing. Putting them in a common "scsi device
> pool" within which they do not enumerate consistently is not something I
> enjoy dealing with.

However, by design choice, we got the SCSI layer in the kernel out of
the business of trying to provide a stable name space, since Richard
Gooch did a brilliant job of demonstrating the insoluability of that
problem. There are many ways to identify a device (UUID being just one
of them). It seems much more desirable to give the users the choice.
You can even have what you seem to want (SATA stably at /dev/sda) simply
by ensuring that you have a modular kernel and that libata always loads
before USB or any other storage device (not that I'd recommend doing
this, because it will fail for a large configuration, but it would work
for you).

> However, the response to my attempts to express this dissatisfaction on the
> SCSI list a few months ago came too close to a flamewar for me to consider
> continuing it productive. I'd still love to update the "2.4 scsi howto" and
> corresponding sg howto, but lack the expertise. The SCSI layer really isn't
> my area, and I was much happier back when I could avoid using it at all.

That was because your initial inquiry came across as "I'm trying to
document this, and by the way it's rubbish". By all means have an
inquiry and an argument, but saying effectively I don't understand this
but I know it's wrong is a guaranteed way to antagonise everyone who's
worked to try to make all of this as functional as possible. Find out
the facts first then argue from them.

> The question I was trying to ask _here_ was about the block layer. I seem not
> to have asked it very well. Sorry 'bout that.

OK, so look at the diagram and the other SCSI documents and come back
for further clarification as you need it.

James


2007-10-15 13:21:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 03:04:00AM -0500, Rob Landley wrote:
> Ok, I'll bite. If it's all "real" scsi, why does ioctl(SG_EMULATED_HOST)
> exist? exist if it's all "real" scsi?

SG_EMULATED_HOST was added before Linux 2.4, at least six or seven
years ago. Back then the migration of ATA devices through the various
versions the ATAPI specification and then into SATA was very early in
its evolution, and back then, yes there were people who considered
anything that didn't use the honking huge parallel SCSI cables not
"real" SCSI. Over time, that distinction at both the physical
connector level and logical level has declined to the point of almost
non-existence. It's note quite at the point where SAS exists only to
justify massive prices differences between commodity and "data-center
grade" disks to the benefit of hard drive manufacturers, but it's
darned close. (There are differences such as voltage levels so that
the max cable differences for SAS are larger, etc., but those could
have been optional additions to the SATA spec, and allegedly SAS
drives are supposedly manufactered to be more robust --- although some
recent papers published at FAST have raised some interesting questions
about how true those marketing claims really are in practice.)

> > just
> > as Ethernet and PPP interfaces really are fundamentally the same
> > thing.
>
> They're the same thing?
>
> Do you mean that on a system with both, going:
> ifconfig eth1 66.92.53.140
> ifconfig ppp 192.168.0.42
>
> Would be functionally equivalent to:
> ifconfig eth1 192.168.0.42
> ifconfig ppp 66.92.53.140

No, of course not. But we don't have separate IP stacks for ethernet
and ppp devices. And how we connect to a host via ssh makes no
difference whether we accessed it via Ethernet or PPP. And I would
argue that how we address a filesystem should also make no difference
depending on the path to hard drive.

> By the way, ethernet cards contain a unique MAC address. Hard
> drives do not seem to, or if they do it's not being consistently
> exposed in a way I can find.

You can pull a Model and Serial number via hdparm -i, but it's not as
easy to manipulate as a fixed-length MAC address. That's why people
tend to use filesystem UUID's.

> > More to the point, with SATA, hot plugging has been designed in, so
> > probing order is not going to be well defined,
>
> The spec may define the capability to hotplug, but your average
> laptop doesn't not offer the capability to hotplug anything into its
> SATA controllers. The hard drive is screwed in (due to the
> portability part of laptopness), all the controllers wired onto the
> motherboard are accounted for, none are exposed externally. What
> _is_ exposed externally is USB, and if you want to add an extra hard
> drive you can buy a cheap USB one at Fry's.

That may be true for laptops today, but Linux doesn't run just on
servers. You can easily get home servers with hot-swap SATA bays. My
home fileserver, which is a white box I purchased on my own nickel,
NOT IBM big iron, has 3TB of raw storage for less than $10,000 a year
ago. Today, that amount of home storage with hot-swap SATA drives and
a battery-backed hardware RAID controller could probably be purchased
for about half that price.

And even for laptops, if you need the performance, you can get Cardbus
cards that will allow you to connect eSATA drives to your laptop at
Fry's.

So even if you ignore "big data center" interconnects like FC, this
problem exists even for commodity grade SATA devices.

I agree at the moment we have an issue where if the root device isn't
guaranteed, it forces people to use initrd's, and the quality and
debuggability of initrd's between distro's is highly variable and not
standardized. In practice though the /dev/sda is actually pretty
stable on laptops, especially if you end up compiling ehci and uhci
support as modules (which is a good idea from a power savings point of
view anyway). The reason why Ubuntu and other distributions are using
UUID-based labels is not just because of the root device, but also for
all of the other disks that might be mounted on the system, including
some that might be using USB devices that don't have stable /dev
names.

> It's necessary for IBM big iron to do this. It's generally not
> necessary for laptops or embedded systems to do this if they
> distinguish between _types_ of devices, which is something they
> until recently did for the types of devices I was interested in, and
> something they _stopped_ doing when everything got merged into the
> scsi layer, and I consider this a regression.

As another example, it's easy to see a home media server running Linux
which doesn't have any expansion bays for additional hard drive --- so
the only way a user could expand their storage is by using one or more
permanently connected USB disks. So we do need to solve the general
device enumeration problem in the general case; it's not just the case
of IBM "big iron" as you seem to think.

> No, distinguishing between types of devices is not a perfect
> solution to device enumeration, but it was sufficient for all my use
> cases for many years, and would still be if the kernel still did it,
> and I'm not alone here.

News flash! The kernel wasn't built just for you, and over time, more
and more people will have multiple disk drives of the same type, so we
will need to solve the device naming problem sooner or later. Why not
solve it sooner, especially given that a number of companies (not just
IBM) are funding the organization that is paying *your* salary are
interested in solving the general case?

Furthermore, I've already pointed a number of situations where the
home user might have multiple USB devices on their system today, and
that is probably going to go up over time, not down. Have you seen
how cheap 500GB USB disks are at Costco? And for a typical
unsophisticated user, plugging in another 500G USB disk when they need
more storage is a lot easier than cracking open the computer case and
futzing with screws and disk cables and power connectors.

- Ted

2007-10-15 13:26:44

by Alan

[permalink] [raw]
Subject: Re: What still uses the block layer?

> You can pull a Model and Serial number via hdparm -i, but it's not as
> easy to manipulate as a fixed-length MAC address. That's why people
> tend to use filesystem UUID's.

ATA8 at the moment looks set to add a true "MAC" or "WWN" type identifier
to each device.. Right now model/serial is not always unique.

2007-10-15 13:36:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 02:29:45PM +0100, Alan Cox wrote:
> > You can pull a Model and Serial number via hdparm -i, but it's not as
> > easy to manipulate as a fixed-length MAC address. That's why people
> > tend to use filesystem UUID's.
>
> ATA8 at the moment looks set to add a true "MAC" or "WWN" type identifier
> to each device.. Right now model/serial is not always unique.

True, but most manufacturers try to make the serial number unique for
their own reasons (like warrantee service), and you can have
manufacturing errors with MAC assignment just as easily as you can
with serial numbers.

I still remember when SGI shipped MIT 20 SGI Indy pizza boxes that all
had the same MAC addresses (that we knew about --- we found out
because all 20 were installed on the same subnet). That was a mildly
entertaining bug to track down.... especially since IIRC, Irix at the
time didn't print warning messages when someone else with a different
IP addresses responded to your MAC address.

- Ted

2007-10-15 14:01:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007 03:36:15 -0500
Rob Landley <[email protected]> wrote:

> The point I was trying to make is that it seems to me like it would
> be possible to keep the namespace separate here, and thus reduce the
> enumeration problems to the point where common cases (like my laptop)
> aren't impacted by them during early boot. I don't think anybody
> (outside the embedded space) is actually upset that /dev/hda now goes
> through the scsi layer: they're upset Ubuntu 7.04 no longer calls
> it /dev/hda.

that's a choice Ubuntu made in their udev scripts... if you don't like
it, complain to them.
I'm surprised you would even need to care about what device name things
are though.... with mount-by-label (deployed for a bunch of years now
in most distros), and various helpful links like /dev/cdrom ....

anyway.. if you don't like your distros udev configuration, lkml is the
wrong forum.

2007-10-15 14:46:47

by Douglas Gilbert

[permalink] [raw]
Subject: Re: What still uses the block layer?

Theodore Tso wrote:
> On Mon, Oct 15, 2007 at 03:04:00AM -0500, Rob Landley wrote:
>> Ok, I'll bite. If it's all "real" scsi, why does ioctl(SG_EMULATED_HOST)
>> exist? exist if it's all "real" scsi?
>
> SG_EMULATED_HOST was added before Linux 2.4, at least six or seven
> years ago.

SG_EMULATED_HOST was present when I started maintaining the
the sg driver in 1997. Back then some folks (one German name
comes to mind) toyed with the idea of sending SCSI Parallel
Interface (SPI) messages through a pass through interface.
SPI messages are obviously transport specific and hence any
app trying to send them needed to ascertain what the transport
was. There were really only two to choose from at the time
(in linux): SPI and the ATA Packet Interface (ATAPI).

If SG_EMULATED_HOST was every used I'm not sure. It is just
an historical remnant now.

Back then the migration of ATA devices through the various
> versions the ATAPI specification and then into SATA was very early in
> its evolution, and back then, yes there were people who considered
> anything that didn't use the honking huge parallel SCSI cables not
> "real" SCSI. Over time, that distinction at both the physical
> connector level and logical level has declined to the point of almost
> non-existence.

On the contrary, the distinction between the logical
(command) level and the transport level (down to the
physical/connector level) is pivotal. There is one
industry accepted storage architecture (SAM (yes, ATA
documents defer to it)), two command sets: ATA and SCSI
(and ways to tunnel one within the other and translate
between the two) and about 10 transports (interconnects)
that I can think of.

Comparisons between PATA and SCSI (SPI) are now history.
More precise terminology is now required.
For example the "ATAPI specification" IMO is a handful
of ATA commands designed to convey a packet based protocol
(which the rest of the ATA command set is not). So ATAPI
could be used to send IP over ATA! Is that what you meant?

It's note quite at the point where SAS exists only to
> justify massive prices differences between commodity and "data-center
> grade" disks to the benefit of hard drive manufacturers, but it's
> darned close. (There are differences such as voltage levels so that
> the max cable differences for SAS are larger, etc., but those could
> have been optional additions to the SATA spec, and allegedly SAS
> drives are supposedly manufactered to be more robust --- although some
> recent papers published at FAST have raised some interesting questions
> about how true those marketing claims really are in practice.)

You should read more about SAS.

Anyway Seagate have announced a ES.2 family of 3.5" disks
that rotate at 7200 rpm. One would not normally expect disks
below 10000 rpm to come with a SCSI transport (FCP, SAS or
SPI) but the ES.2 series breaks the pattern since it
comes with either a SATA or a SAS interface. What will be
really interesting is how Seagate will price the two versions.
Apart from the SAS variant having dual ports it is pretty
close to an apples versus apples comparison.

A port selector could be added to the SATA variant to provide
dual port functionality. However the SCSI command set offers
persistent reservations which are beyond the scope of ATA
command sets which assume a logical point to point connection.

Doug Gilbert

2007-10-15 16:08:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 04:26:04AM -0500, Rob Landley wrote:
> For example, usb devices are never easy to order. IDE devices (back when they
> had their own namespace) were trivial to order back when /dev/hda couldn't
> move without use of a screwdriver.

Ah, but it could. If you had more than one IDE controller (which is
even possible on laptops; the Fujitsu P7120 is one that I'm familiar
with that has more than one), the initialisation order *of the
controllers* would change which was hda and which was hde.

> Combining USB and IDE into the same /dev/sd? namespace makes enumerating the
> IDE devices much harder than in the traditional "/dev/hdb doesn't move
> without a screwdriver" model. The merger creates a new problem for IDE, one
> which didn't exist before: the addition or removal of other unrelated types
> of devices may change this device's location next boot. It may be possible
> to add additional complication to the system to compensate, but what was the
> advantage of merging the namespaces in the first place?

It's not something anyone particularly set out to do, it's just how
it worked out. It was justified by saying "ok, this goes from a 99%
solution to a 96% solution, but there's 100% solution called uuids".
I don't particularly agree with this line of argumentation, but it did
hold sway.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-10-15 17:11:54

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

Matthew Wilcox wrote:
> On Mon, Oct 15, 2007 at 04:26:04AM -0500, Rob Landley wrote:
>> Combining USB and IDE into the same /dev/sd? namespace makes enumerating the
>> IDE devices much harder than in the traditional "/dev/hdb doesn't move
>> without a screwdriver" model. The merger creates a new problem for IDE, one
>> which didn't exist before: the addition or removal of other unrelated types
>> of devices may change this device's location next boot. It may be possible
>> to add additional complication to the system to compensate, but what was the
>> advantage of merging the namespaces in the first place?
>
> It's not something anyone particularly set out to do, it's just how
> it worked out. It was justified by saying "ok, this goes from a 99%
> solution to a 96% solution, but there's 100% solution called uuids".
> I don't particularly agree with this line of argumentation, but it did
> hold sway.

Low-level networking drivers suggest a default interface name (per
interface or as a template like eth%d into which the networking core
inserts a lowest spare number). Userspace can rename interfaces, but
nevertheless it's nice to have different default kernel names for
ethernet, wlan etc..

Could low-level SCSI drivers provide similar name templates which give a
hint on the transport involved? It's a bit more difficult as with
networking interfaces though because
- SCSI devices can have sd, sr, st, osst, ch, sg interfaces,
- SCSI device files share a namespace with all other device files.

E.g.
/dev/sd-ide-b - second IDE HDD,
/dev/sd-iscsi-e - fifth iSCSI direct access device,
/dev/sr-sata-0 - first SATA CD-ROM,
/dev/sr-usb-0 - a USB CD-ROM,
/dev/st-fw-0 - a FireWire tape drive,
/dev/sda - a device whose transport driver didn't propose a name

Of course the really interesting names will still be provided by
udev-generated symlinks.
--
Stefan Richter
-=====-=-=== =-=- -====
http://arcgraph.de/sr/

2007-10-15 17:26:54

by Greg KH

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 03:36:15AM -0500, Rob Landley wrote:
>
> The point I was trying to make is that it seems to me like it would be
> possible to keep the namespace separate here, and thus reduce the enumeration
> problems to the point where common cases (like my laptop) aren't impacted by
> them during early boot.

Proposals on how to do this would be gladly reviewed.

But again, please remember that these USB devices are really SCSI
devices. Same for SATA devices. There is a reason they are using the
SCSI layer, and it isn't just because the developers felt like it :)

> I don't think anybody (outside the embedded space) is actually upset
> that /dev/hda now goes through the scsi layer: they're upset Ubuntu
> 7.04 no longer calls it /dev/hda.

Use mount-by-label instead, it's much saner and handles device name
movement just fine (as does the UUID method that you seem to hate.)
Look in /dev/disk/ for a wide range of options that you have in which to
choose how to pick your block devices.

Oh, and this seems like a very Ubuntu specific rant, might I suggest you
contact the Ubuntu developers about this? The kernel doesn't dictate
that the distro has to use these long identifiers, and there is nothing
we can do about it.

good luck,

greg k-h

2007-10-15 17:34:33

by Greg KH

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 05:08:36AM -0500, Rob Landley wrote:
> On Monday 15 October 2007 4:06:20 am Julian Calaby wrote:
> > On 10/15/07, Rob Landley <[email protected]> wrote:
> > > I note that the eth0 and eth1 names are dynamically assigned on a first
> > > come first serve basis (like scsi). This never causes me a problem
> > > because the driver loading order is constant, and once you figure out
> > > that eth0 is gigabit and eth1 is the 80211g it _stays_ that way across
> > > reboots, reliably. Yeah, it's a heuristic. Hands up everybody relying on
> > > such a heuristic in the real world.
> >
> > Umm, not quite, from my experiences with pre-production wireless
> > drivers, (another story, another time) fancy stuff is being done in
> > udev to make sure that your gigabit card is always assigned to eth0.
>
> I remember building a 2.4 kernel, statically linking in all the drivers, and
> getting the ethernet devices showing up in a reliable order for years. Where
> does the need for fancy stuff come in?

Because PCI devices reorder their bus numbers all the time. And we have
ethernet devices hanging off of USB connections now (yes, even built-in
to the machine), and we have network connections on other hot-pluggable
busses (remember, PCI is hot pluggable.)

So, the distros need to name network devices in a persistant way, that
is why the distros now do this. If you don't like the distro doing it,
complain to them, it's not a kernel issue :)

thanks,

greg k-h

2007-10-15 17:45:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: What still uses the block layer?

Alan Cox wrote:
>> You can pull a Model and Serial number via hdparm -i, but it's not as
>> easy to manipulate as a fixed-length MAC address. That's why people
>> tend to use filesystem UUID's.
>
> ATA8 at the moment looks set to add a true "MAC" or "WWN" type identifier
> to each device.. Right now model/serial is not always unique.

WWN was added in ATA-7, AFAICS.

However, I've seen quite a few ATA-7 devices that do not bother to fill
it in. I wonder if ATA-8 device firmwares will act with similar
slackness. :)

Jeff



2007-10-15 18:01:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 10:25:13AM -0700, Greg KH wrote:
> Use mount-by-label instead, it's much saner and handles device name
> movement just fine (as does the UUID method that you seem to hate.)
> Look in /dev/disk/ for a wide range of options that you have in which to
> choose how to pick your block devices.

But you still have to spin up the disc to read the label (which seems
like a legitimate complaint to me).

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-10-15 18:50:24

by Jeff Garzik

[permalink] [raw]
Subject: Re: What still uses the block layer?

Greg KH wrote:
> On Mon, Oct 15, 2007 at 03:36:15AM -0500, Rob Landley wrote:
>> The point I was trying to make is that it seems to me like it would be
>> possible to keep the namespace separate here, and thus reduce the enumeration
>> problems to the point where common cases (like my laptop) aren't impacted by
>> them during early boot.
>
> Proposals on how to do this would be gladly reviewed.

Agreed.


> But again, please remember that these USB devices are really SCSI
> devices. Same for SATA devices. There is a reason they are using the
> SCSI layer, and it isn't just because the developers felt like it :)

/somewhat/ true I'm afraid: libata uses the SCSI layer for ATAPI
devices because they are essentially bridges to SCSI devices. It uses
the SCSI layer for ATA devices because the SCSI layer provided a huge
amount of infrastructure that would need to have been otherwise
duplicated, /then/ massaged into coordinating between <jgarzik's ATA
layer> and <SCSI layer> when dealing with ATAPI.

There is also a detail that was of /huge/ value when introducing a new
device class: distro installers automatically work, if you use SCSI.
If you use a new block device type, that behaves differently from other
types and is on a different major, you have to poke the distros into
action or do it yourself.

IOW, it was the high Just Works(tm) value of the SCSI layer when it came
to ATA (not ATAPI) devices.

For the future, ATA will eventually be more independent (though the SCSI
simulator will be available as an option, for compat), but the value is
big enough to put that task on the back-burner.

Jeff



2007-10-15 18:57:22

by Matthew Garrett

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 07:00:22AM -0700, Arjan van de Ven wrote:

> that's a choice Ubuntu made in their udev scripts... if you don't like
> it, complain to them.

Keeping the naming as hda while changing the semantics (such as the
reduced number of partitions) would have been differently confusing. We
did look into keeping compatibility symlinks, but decided to just
transition everything to UUIDs instead.

--
Matthew Garrett | [email protected]

2007-10-15 20:40:58

by Wilfried Klaebe

[permalink] [raw]
Subject: Re: What still uses the block layer?

Am Mon, Oct 15, 2007 at 04:26:04AM -0500 schrieb Rob Landley:
> To clarify, I think that merging ide, sata, usb, firewire, and others into a
> single device namespace causes each type of device to inherit that
> namespace's cumulative ordering issues, which is a bad thing. I have no real
> attachment to the underlying scsi or block layers. I've never seriously
> worked on either (although I'm trying to understand both).
>
> For example, usb devices are never easy to order. IDE devices (back when they
> had their own namespace) were trivial to order back when /dev/hda couldn't
> move without use of a screwdriver. USB and IDE devices are very different in
> that it's not possible to plug a USB device into an IDE controller (not
> without one _heck_ of an adapter) and vice versa. USB devices usually live
> outside the computer's case, and IDE devices inside the case. They're not
> the same thing.
>
> Combining USB and IDE into the same /dev/sd? namespace makes enumerating the
> IDE devices much harder than in the traditional "/dev/hdb doesn't move
> without a screwdriver" model.

I have udev here, and it generates several useful symlinks.
/dev/disk/by-path/pci-0000:00:1f.1-scsi-0:0:0:0-part2 will always point
to the second primary partition of the IDE master on the first IDE
channel here, be there as many USB sticks as there may.

(But still I'd like it if it wasn't named "scsi-0:0:0:0", because the
"0:0:0:0" part could change.)

> The merger creates a new problem for IDE, one
> which didn't exist before: the addition or removal of other unrelated types
> of devices may change this device's location next boot. It may be possible
> to add additional complication to the system to compensate, but what was the
> advantage of merging the namespaces in the first place?

I don't think there was any intent to merge namespaces. It "just happened"
as a byproduct of having sata/pata use the scsi subsystem.

Wilfried
--
Irgendwas ist ja immer...

2007-10-15 21:35:20

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 5:32:32 am Lo?c Greni? wrote:
> You are really looking like you are out for a fight.
...
> Your objection is interesting. It is lost in the middle of e-mails
> which, to the untrained eye, look like you are trying to fight everyone and
> everybody.
...
> ...holy external disk...
> ...holy external hard...
...
> You would probably have received more interesting answers and less
> insults.
...
> Once again. You are so aggressive in your asking that it does not
> lead to an interesting discussion.
...
> Out for a fight ?

This is where I hit my ad hominem attack quota and stopped reading.

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 21:35:40

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 6:19:58 am Neil Brown wrote:
> On Monday October 15, [email protected] wrote:
> > This is my objection. Even when enumerating multiple devices of the same
> > type is tricky, enumerating multiple devices of _different_ types should
> > not be. There's a great big type indicator that is being _deliberately_
> > ignored, and large classes of devices (millions of laptops) where you
> > know there's only going to be _one_ instance of a given type.
>
> My perspective is different.
>
> The range of addressing option for "all disk devices" is far too rich
> to be able to assign a stable device number every device: there are
> multiple, multi-dimensional addressing scheme, and some devices might
> not even have a stable address at all (e.g. USB?).
> So the reality of dealing with disk devices is that you cannot provide
> a stable single-number naming scheme for all devices on all machines.

Sure.

> Therefore it is best to not have stable single-number naming schemes
> for any devices on any machines. Why? Because it ensure there will
> not be any second class citizens.

This is where we disagree. The existence of devices you cannot stably
enumerate does not eliminate the existence of devices you trivially can.

Pulling out the "IBM numa cluster with multiple SAS enclosures _and_ firewire"
infrastructure to find the root partition on my hard drive may be good for
the IBM numa clusters, but only at the expense of complicating this part of
my laptop's infrastructure by an order of magnitude, and making embedded
systems nearly impossible to put together. If "one size fits all" were true,
my cell phone would be running Red Hat Enterprise.

> If some devices that are even reasonably common (e.g. IDE drives) are
> stable, then some application developers or system integrators will
> work under the assumption of stability and whatever they build will
> break when you try it on different hardware.

So you break the IDE drives to get laptop users to debug the Niagra set? The
solution is to make the easy cases hard?

> This happened during the
> early days of SCSI support - code assumed the stability of
> major/minor numbers and so did not work properly with SCSI which
> cannot provide that stability in general.

In this case, I ripped the relevant infrastructure out by hand so fstab again
has /dev/sda. I can do it again on future systems. I'd just really rather
not have to.

> Having a totally uniform approach makes development and testing a lot
> easier - there are fewer special cases.

There are actually more special cases, you just expose more people to them.

> I would prefer that 'total uniformity' went even further than
> /dev/sd?? to /dev/disk??. i.e. Anything that is or behaves
> substantially like a disk drive should be "/dev/diskXX", where numbers
> are assigned sequentially on discovery. (I wonder why we need
> /dev/scdX to be separate from /dev/sdX).

It's /dev/srX here, and I have no idea.

I believe merging these namespaces invents problems, and was a bad idea. I
understand you're reasoning, but imposing the problems of mainframes onto
laptops does not strike me as an improvement for laptops.

> Note that stable names a still a very real option. udev provides
> several. /dev/disk-by-path/XXX will be stable for lots of "screwed
> in" devices. /dev/disk-by-id will be stable for devices the report a
> unique id. etc.

Here it's

ls /dev/disk/by-path/
pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-0:0:0:0-part4
pci-0000:00:1f.2-scsi-0:0:0:0-part1 pci-0000:00:1f.2-scsi-0:0:0:0-part5
pci-0000:00:1f.2-scsi-0:0:0:0-part2 pci-0000:00:1f.2-scsi-0:0:0:0-part6
pci-0000:00:1f.2-scsi-0:0:0:0-part3 pci-0000:00:1f.2-scsi-1:0:0:0

And this is an improvement?

> The different between IDE, SATA, SCSI and even USB is peripheral for
> the large majority of uses, and I think maintaining the distinction in
> the major/minor number or in the "primary" /dev name is - for the
> above reasons - more of a cost that a value.

Is your definition of "the large majority of uses" where ncr Voyager, the
Amiga, and current macintosh laptops are all one use each, or is your
definition of "the large majority of uses" the one where each "use" is an
installation, of which there are millions of PCs (and even more ARM cell
phones), and something like three instances of Voyager?

I realize that both views are valid. This is why the US has a house and a
senate, and filters things through both views. My gripe is that forcing my
laptop to look at my USB devices to find my SATA hard drive is aligned with
only one of those viewpoints, and completely opposed to the other.

An approach that makes things much easier on laptops is seen to hurt big iron,
not because it the approach itself has a direct negative impact on big iron,
but only because then laptops are not saddled with the problems of big iron.

Why do you allow uni-processor kernel builds then?

> NeilBrown

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 21:46:44

by Jeff Garzik

[permalink] [raw]
Subject: Re: What still uses the block layer?

Rob Landley wrote:
> I realize that both views are valid. This is why the US has a house and a
> senate, and filters things through both views. My gripe is that forcing my
> laptop to look at my USB devices to find my SATA hard drive is aligned with
> only one of those viewpoints, and completely opposed to the other.
>
> An approach that makes things much easier on laptops is seen to hurt big iron,
> not because it the approach itself has a direct negative impact on big iron,
> but only because then laptops are not saddled with the problems of big iron.


And we are telling you that, in a modern hotplug world -- yes even on a
laptop -- you are clinging too much to assumptions that were never 100%
true in the first place, and much less so on today's laptops.

When you can unplug a SATA drive from a laptop, and plug it back in via
USB, you can see how unwise it is to hardcode device names into your fstab.

We invented udev, sysfs, mount-by-label, mount-by-uuid, LVM and all
sorts of other gadgets to make this problem go away.

If you ignore the solutions that exist to solve these problems, then of
course annoyances will persist as the state of hardware marches forward.

Jeff


2007-10-15 21:53:28

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 8:10:49 am James Bottomley wrote:
> OK, so could we get back to the original discussion? The question I
> think you meant to ask is "does SCSI use the block layer, and if so;
> how?"
>
> The answer is yes (just do an ls /sys/block on any scsi machine). The
> how is that it bascially uses the block layer as a service library (i.e.
> most SCSI services are built on top of those already provided by block).
> The email you cited was basically from our one area of confusion: SCSI
> and block both provide services to decode the SG_IO ioctl. This is
> partly historical; block and SCSI are very much intertwined; so much so
> that they both tend to drive each other's development. The programme
> over the last few years has been to identify features in SCSI that
> should be more generic (and hence moved to block). SG_IO is one of
> these, so we end up with the situation where Block provides this as a
> service (and sr, st and sd make use of it) while the sg driver still
> doesn't use what the block layer provides but rolls its own. I think
> the layout of how all this works is illustrated at a reasonably high
> level here on slide 15:
>
> http://licensing.steeleye.com/support/papers/ols_2005_slides.pdf

Thanks, that's exactly what I wanted to know.

> > However, the response to my attempts to express this dissatisfaction on
> > the SCSI list a few months ago came too close to a flamewar for me to
> > consider continuing it productive. I'd still love to update the "2.4
> > scsi howto" and corresponding sg howto, but lack the expertise. The SCSI
> > layer really isn't my area, and I was much happier back when I could
> > avoid using it at all.
>
> That was because your initial inquiry came across as "I'm trying to
> document this, and by the way it's rubbish".

Sorry about that. Not my intent. I was aiming more at "I'm trying to
document this and I don't understand how it works at all, or why it does
things this way. It seems backwards from what I would expect."

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 21:59:06

by Alan

[permalink] [raw]
Subject: Re: What still uses the block layer?

> This is where we disagree. The existence of devices you cannot stably
> enumerate does not eliminate the existence of devices you trivially can.

"trivially"

You are I assume familiar in full with EDD 3.0, EDD 1.x and the Ralf
Brown documentation on the BIOS drive mappings and tables for different
BIOSes ?

If you are then you could add EDD 1.x spport, FADT parsing and update the
EDD driver to produce links to the drives in BIOS map order. Would be
quite useful but very few people on the planet actually know all the
arcana to do this.

Alan

2007-10-15 22:54:28

by Rob Landley

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday 15 October 2007 12:25:13 pm Greg KH wrote:
> Oh, and this seems like a very Ubuntu specific rant, might I suggest you
> contact the Ubuntu developers about this? The kernel doesn't dictate
> that the distro has to use these long identifiers, and there is nothing
> we can do about it.

I was just trying to use the strangeness in a large distributor's first
attempt at this functionality as an evidence that it's apparently not trivial
to get even the common cases right under the new model, while the common
cases used to be trivial to get right under the old model. (Or at least it
seemed so to me.)

I think I've exhausted this line of argument, though, and will stop now.

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-15 23:41:38

by NeilBrown

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Monday October 15, [email protected] wrote:
> > Therefore it is best to not have stable single-number naming schemes
> > for any devices on any machines. Why? Because it ensure there will
> > not be any second class citizens.
>
> This is where we disagree. The existence of devices you cannot stably
> enumerate does not eliminate the existence of devices you trivially can.

No, but it dramatically reduces that value of being able to enumerate
those devices.

>
> Pulling out the "IBM numa cluster with multiple SAS enclosures _and_ firewire"
> infrastructure to find the root partition on my hard drive may be good for
> the IBM numa clusters, but only at the expense of complicating this part of
> my laptop's infrastructure by an order of magnitude, and making embedded
> systems nearly impossible to put together. If "one size fits all" were true,
> my cell phone would be running Red Hat Enterprise.
>
> > If some devices that are even reasonably common (e.g. IDE drives) are
> > stable, then some application developers or system integrators will
> > work under the assumption of stability and whatever they build will
> > break when you try it on different hardware.
>
> So you break the IDE drives to get laptop users to debug the Niagra set? The

Breaking old behaviour is always bad... My computers with IDE
interfaces still see stable "/dev/hda" devices. Are you saying the
devices that used to be "hda" are now "sdb" ?? Maybe there is a
.config option...

> solution is to make the easy cases hard?

Is it really that hard?

> > Note that stable names a still a very real option. udev provides
> > several. /dev/disk-by-path/XXX will be stable for lots of "screwed
> > in" devices. /dev/disk-by-id will be stable for devices the report a
> > unique id. etc.
>
> Here it's
>
> ls /dev/disk/by-path/
> pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-0:0:0:0-part4
> pci-0000:00:1f.2-scsi-0:0:0:0-part1 pci-0000:00:1f.2-scsi-0:0:0:0-part5
> pci-0000:00:1f.2-scsi-0:0:0:0-part2 pci-0000:00:1f.2-scsi-0:0:0:0-part6
> pci-0000:00:1f.2-scsi-0:0:0:0-part3 pci-0000:00:1f.2-scsi-1:0:0:0
>
> And this is an improvement?

Depends on your metric.

"Easy to type" - I guess /dev/hda1 wins hands down.
"Can be used in a script or config file and is guaranteed always to
work until a screwdriver is used to change that device or it's
controller"
I think
/dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0-part1
is quite acceptable.
What is your metric?


>
> > The different between IDE, SATA, SCSI and even USB is peripheral for
> > the large majority of uses, and I think maintaining the distinction in
> > the major/minor number or in the "primary" /dev name is - for the
> > above reasons - more of a cost that a value.
>
> Is your definition of "the large majority of uses" where ncr Voyager, the
> Amiga, and current macintosh laptops are all one use each, or is your
> definition of "the large majority of uses" the one where each "use" is an
> installation, of which there are millions of PCs (and even more ARM cell
> phones), and something like three instances of Voyager?

My definition of "the large majority or uses" is "mkfs, fsck, mount,
fdisk, system-install-process".

Different people differentiate devices in different ways. A system
integrator might know about the hardware path. An end user might know
about drive brands or sizes. A casual user might just think "internal
or external". The kernel cannot support all these different
approaches to naming. It really is best if it uses arbitrary names,
and provides access to descriptions that the user can choose between.
udev facilitates this with links in /dev/disk/. A system install can
facilitate this even more by reporting size/manufacturer information etc.

>
> I realize that both views are valid. This is why the US has a house and a
> senate, and filters things through both views. My gripe is that forcing my
> laptop to look at my USB devices to find my SATA hard drive is aligned with
> only one of those viewpoints, and completely opposed to the other.

I'm guessing you are talking about mount-by-uuid? This effectively has
to look at the filesystem of all devices to discover which one has the
correct UUID, though it can cache the information for efficiency.

Maybe it is just an implementation issue. Suppose that everytime a
device were discovered, it were examined to see what was stored on it,
and this information was stored in a cache.
Then to find a particular filesystem to mount, you just look in the
cache and if the info isn't there yet, just wait or fail as
appropriate.
Then we don't "look at my USB devices to find my SATA hard drive" but
rather "look at each device as it is attached to find out what is in
it", which seems like a sensible thing to do...

>
> An approach that makes things much easier on laptops is seen to hurt big iron,
> not because it the approach itself has a direct negative impact on big iron,
> but only because then laptops are not saddled with the problems of big iron.

I think your "laptops vs big iron" contrast is making the gap seem
bigger than it really is. Naming issues are present in laptops and
easily get significant is modest servers.

>
> Why do you allow uni-processor kernel builds then?

Funny you should suggest that...
I don't think OpenSuSE10.3 includes any UP kernels. There is code in
the kernel which detects the single processor case and removes some
the more expense "LOCK" operations to reduce the cost of using an SMP
kernel on a UP computer.
There is real value in reducing the number of options, and people have
obviously put work into making that a cost-effective proposition.

NeilBrown

2007-10-15 23:49:37

by Julian Calaby

[permalink] [raw]
Subject: Re: What still uses the block layer?

[adding back CCs which were dropped because I'm stupid - sorry!]

On 10/16/07, Rob Landley <[email protected]> wrote:
> On Monday 15 October 2007 5:27:55 am Julian Calaby wrote:
> > On 10/15/07, Rob Landley <[email protected]> wrote:
> > > On Monday 15 October 2007 4:06:20 am Julian Calaby wrote:
> > > > On 10/15/07, Rob Landley <[email protected]> wrote:
> > > > > I note that the eth0 and eth1 names are dynamically assigned on a
> > > > > first come first serve basis (like scsi). This never causes me a
> > > > > problem because the driver loading order is constant, and once you
> > > > > figure out that eth0 is gigabit and eth1 is the 80211g it _stays_
> > > > > that way across reboots, reliably. Yeah, it's a heuristic. Hands up
> > > > > everybody relying on such a heuristic in the real world.
> > > >
> > > > Umm, not quite, from my experiences with pre-production wireless
> > > > drivers, (another story, another time) fancy stuff is being done in
> > > > udev to make sure that your gigabit card is always assigned to eth0.
> > >
> > > I remember building a 2.4 kernel, statically linking in all the drivers,
> > > and getting the ethernet devices showing up in a reliable order for
> > > years. Where does the need for fancy stuff come in?
> >
> > I remember that too. In fact, I have had no issues with network card
> > enumeration order, outside my own inexperience and stupidity.
> >
> > However, this sort of thing is needed now because of the various types
> > of hotpluggable networking devices, e.g. USB 802.11 cards, USB
> > ethernet cards, PCMCIA, etc.
>
> I thought the strategy was just to scan the hotpluggable busses after the
> non-hotpluggable busses.

My (practical) experience is that I couldn't guarantee which card was
which. (I remember once where it changed over a kernel re-compile) So
my solution, before Debian's persistent naming scheme appeared, was to
check it after every new kernel and make sure my config matched up
with the names of the physical interfaces.

> > And yes, PCMCIA worked fine for ages, but
> > usually you'd never have more than one PCMCIA network card.
>
> Still don't, but presumably the slots are scanned in a reliable order so if
> the cards are always present on bootup in the same slots, they'd stay in that
> order.

Well, yes and no. My gut feeling is that it's probed like PCI cards
are. They're initialised when the drivers are loaded, and not before,
as such, there are no guarantees which card will be initialised first.
- and anyway, what happens if you plug them in in a different order?

> > Personally, I use 2 different usb network cards, and I'm quite
> > comforted to know that the 802.11a one is always wlan0, and the
> > 802.11b/g one is always wlan1.
>
> So if I have a USB 100baseT adapter, and I boot with it plugged in, it'll
> potentially come before my built-in wireless card due to ordering based on
> device type?

Ok, firstly the 100baseT adapter will be named something like ethX,
the wireless card will most likely be named something like wlanX.

Now let's say your laptop has a built in ethernet card.

So, we'll assume a modular kernel, with the module "usbnet" for the
usb card and "e100" for the onboard card:

If the "usbnet" module is loaded first, then initially, according to
the kernel, the usb card will be eth0 and the built in one eth1.

Now let's assume that, on the PCI bus, the USB controller is in a
lower slot number than the network card. (highly likely, given that
the network card is most likely external to the chipset of the laptop)
It's pretty likely that the USB controller will have it's module
loaded first, before the built in network card. At this point, it'll
send out hotplug events for all it's children (root hubs, etc.) and
eventually an event will be sent out for the usb network card. Now, at
this point, it's impossible to say which one will claim eth0 first.

Now, in my case, with my two wireless cards, what happens if I plug
the 802.11b/g one in first? If this fancy renaming didn't happen, it'd
end up with the name wlan0 and, hence, try to connect to the network
which the 802.11a one is supposed to connect to.

This is not a good thing.

I also have to make the point that this has been happening all over
the kernel, well before I started using it. Video4Linux and DVB
devices can be USB, and the order the /dev/videoX nodes appear in is
determined by the plugging order. IRDA cards, sound cards, usb
devices, framebuffers, mice, keyboards, loopback devices, etc. all
have the same "issue". (and annoyingly, they all have different ways
of getting around it, or not)

And to make one final point, getting right back to the initial parts
of the discussion, at the end of the day, your SATA disk, IDE disk,
USB disk and the CF card in your camera are all mass storage devices -
they all work in a fairly similar way. You want to mount filesystems
from all of them, and when you run low level tools, like parted or
whatever, you want them all to behave in the same way. If the kernel
abstract away the nastinesses of talking SATA, IDE, USB mass storage,
or CF - and hence, make them all behave in the same, standard, way,
why the hell not?

Thanks,

--

Julian Calaby

Email: [email protected]

2007-10-16 02:08:55

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Tue, 16 Oct 2007, Neil Brown wrote:

> On Monday October 15, [email protected] wrote:
>>> Therefore it is best to not have stable single-number naming schemes
>>> for any devices on any machines. Why? Because it ensure there will
>>> not be any second class citizens.
>>
>> This is where we disagree. The existence of devices you cannot stably
>> enumerate does not eliminate the existence of devices you trivially can.
>
> No, but it dramatically reduces that value of being able to enumerate
> those devices.

this is the point of disagreement. the devices you can trivially enumerate
can be handled easily and trivially, the ones that you can't may require
more complex things to handle them, but that depends on the situation. If
you only have one USB drive on a system you don't need to worry about what
order USB hotplug events come in if you can just say 'the first USB
drive'. mixing the different types of devices into one namespace
complicates things in a couple of ways.

1. devices that used to have stable names no longer have stable names
without extra effort.

2. having multiple seperate unstable namespaces with one name in each of
them looks to the user like a stable namespace, since the instability
never comes into play. combineing these into a single namespace looses
this stability

>>
>> Pulling out the "IBM numa cluster with multiple SAS enclosures _and_ firewire"
>> infrastructure to find the root partition on my hard drive may be good for
>> the IBM numa clusters, but only at the expense of complicating this part of
>> my laptop's infrastructure by an order of magnitude, and making embedded
>> systems nearly impossible to put together. If "one size fits all" were true,
>> my cell phone would be running Red Hat Enterprise.
>>
>>> If some devices that are even reasonably common (e.g. IDE drives) are
>>> stable, then some application developers or system integrators will
>>> work under the assumption of stability and whatever they build will
>>> break when you try it on different hardware.
>>
>> So you break the IDE drives to get laptop users to debug the Niagra set? The
>
> Breaking old behaviour is always bad... My computers with IDE
> interfaces still see stable "/dev/hda" devices. Are you saying the
> devices that used to be "hda" are now "sdb" ?? Maybe there is a
> .config option...

yes, this changed. If you run your IDE drives with the PATA drivers of
libata they show up as sdX, and are subject to the same detection order
issues as any other sd device.

>> solution is to make the easy cases hard?
>
> Is it really that hard?
>
>>> Note that stable names a still a very real option. udev provides
>>> several. /dev/disk-by-path/XXX will be stable for lots of "screwed
>>> in" devices. /dev/disk-by-id will be stable for devices the report a
>>> unique id. etc.
>>
>> Here it's
>>
>> ls /dev/disk/by-path/
>> pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-0:0:0:0-part4
>> pci-0000:00:1f.2-scsi-0:0:0:0-part1 pci-0000:00:1f.2-scsi-0:0:0:0-part5
>> pci-0000:00:1f.2-scsi-0:0:0:0-part2 pci-0000:00:1f.2-scsi-0:0:0:0-part6
>> pci-0000:00:1f.2-scsi-0:0:0:0-part3 pci-0000:00:1f.2-scsi-1:0:0:0
>>
>> And this is an improvement?
>
> Depends on your metric.
>
> "Easy to type" - I guess /dev/hda1 wins hands down.
> "Can be used in a script or config file and is guaranteed always to
> work until a screwdriver is used to change that device or it's
> controller"
> I think
> /dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0-part1
> is quite acceptable.
> What is your metric?

does it have to be one or the other? /dev/hda1 suceeded on both metrics.


>>> The different between IDE, SATA, SCSI and even USB is peripheral for
>>> the large majority of uses, and I think maintaining the distinction in
>>> the major/minor number or in the "primary" /dev name is - for the
>>> above reasons - more of a cost that a value.
>>
>> Is your definition of "the large majority of uses" where ncr Voyager, the
>> Amiga, and current macintosh laptops are all one use each, or is your
>> definition of "the large majority of uses" the one where each "use" is an
>> installation, of which there are millions of PCs (and even more ARM cell
>> phones), and something like three instances of Voyager?
>
> My definition of "the large majority or uses" is "mkfs, fsck, mount,
> fdisk, system-install-process".
>
> Different people differentiate devices in different ways. A system
> integrator might know about the hardware path. An end user might know
> about drive brands or sizes. A casual user might just think "internal
> or external". The kernel cannot support all these different
> approaches to naming. It really is best if it uses arbitrary names,
> and provides access to descriptions that the user can choose between.
> udev facilitates this with links in /dev/disk/. A system install can
> facilitate this even more by reporting size/manufacturer information etc.

but is the possibility of wanting different options really sufficiant
reason to eliminate every stable option? right now the /dev names are
essentially random without external help. why couldn't they be stable (in
all cases where that is possible) and let people who are happy with the
defaults not run the external helpers, but leave them as options for
people who do want things to be different.

>>
>> I realize that both views are valid. This is why the US has a house and a
>> senate, and filters things through both views. My gripe is that forcing my
>> laptop to look at my USB devices to find my SATA hard drive is aligned with
>> only one of those viewpoints, and completely opposed to the other.
>
> I'm guessing you are talking about mount-by-uuid? This effectively has
> to look at the filesystem of all devices to discover which one has the
> correct UUID, though it can cache the information for efficiency.
>
> Maybe it is just an implementation issue. Suppose that everytime a
> device were discovered, it were examined to see what was stored on it,
> and this information was stored in a cache.
> Then to find a particular filesystem to mount, you just look in the
> cache and if the info isn't there yet, just wait or fail as
> appropriate.
> Then we don't "look at my USB devices to find my SATA hard drive" but
> rather "look at each device as it is attached to find out what is in
> it", which seems like a sensible thing to do...

this would still require spinning up every drive and looking at it to find
the UUID.

>>
>> An approach that makes things much easier on laptops is seen to hurt big iron,
>> not because it the approach itself has a direct negative impact on big iron,
>> but only because then laptops are not saddled with the problems of big iron.
>
> I think your "laptops vs big iron" contrast is making the gap seem
> bigger than it really is. Naming issues are present in laptops and
> easily get significant is modest servers.

maby it's becouse I've been useing linux for so long (since before 1.0),
but I have not been seeing the same thing, it's possible that none of the
several hundred servers I've built and managed have been big enough to
have the problems that you describe, but the recent 'fixes' for these
problems have been more painful for me than the original problems.

yes I have had kernel upgrades that changed the link order of drivers and
I've had to deal with that, but I still have that problem today, with udev
and friends involved. I recently was installing linux onto machines with
multiple SCSI controllers and had all sorts of fun becouse the install
disk detection order wasn't the same as the installed kernel detection
order, causing the installer to decide teh wrong drive was the boot drive
and put the boot loader in the wrong place (and this happened for multiple
distros). To get things working I finally did the install, then dug up my
old slackware boot disks to get into the system and manually install the
boot loader to fix things up.

I've also had problems with distro boot systems not working with labels
becouse there were too many drives in the system and it gave up before
checking far enough to find the root partition (on that machine the root
partition was sdr2)

>> Why do you allow uni-processor kernel builds then?
>
> Funny you should suggest that...
> I don't think OpenSuSE10.3 includes any UP kernels. There is code in
> the kernel which detects the single processor case and removes some
> the more expense "LOCK" operations to reduce the cost of using an SMP
> kernel on a UP computer.
> There is real value in reducing the number of options, and people have
> obviously put work into making that a cost-effective proposition.

but there's a huge difference between a distro deciding to not include UP
kernels and removing the option to build a UP kernel from the kernel
entirely. Nobody is saying that Ubuntu (or any other distro) should be
prohibited from makeing everything SMP, or i686, we are just saying that
the option to compile something UP or i486 should not be removed just
becouse distros don't choose to use them much. (has the i386 option been
completely erradicated yet? or is it still hanging on)

David Lang

2007-10-16 02:47:53

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007, Theodore Tso wrote:

> On Mon, Oct 15, 2007 at 03:04:00AM -0500, Rob Landley wrote:
>
>>> just
>>> as Ethernet and PPP interfaces really are fundamentally the same
>>> thing.
>>
>> They're the same thing?
>>
>> Do you mean that on a system with both, going:
>> ifconfig eth1 66.92.53.140
>> ifconfig ppp 192.168.0.42
>>
>> Would be functionally equivalent to:
>> ifconfig eth1 192.168.0.42
>> ifconfig ppp 66.92.53.140
>
> No, of course not. But we don't have separate IP stacks for ethernet
> and ppp devices. And how we connect to a host via ssh makes no
> difference whether we accessed it via Ethernet or PPP. And I would
> argue that how we address a filesystem should also make no difference
> depending on the path to hard drive.

I think a close analogy would be that after a partition is mounted you
don't need to know the path to the hard drive, and that is already true
today. when you mount a drive (or assign and IP address to a network
interface) the path to the device not only matters, it's critical.

>> By the way, ethernet cards contain a unique MAC address. Hard
>> drives do not seem to, or if they do it's not being consistently
>> exposed in a way I can find.
>
> You can pull a Model and Serial number via hdparm -i, but it's not as
> easy to manipulate as a fixed-length MAC address. That's why people
> tend to use filesystem UUID's.
>
>>> More to the point, with SATA, hot plugging has been designed in, so
>>> probing order is not going to be well defined,
>>
>> The spec may define the capability to hotplug, but your average
>> laptop doesn't not offer the capability to hotplug anything into its
>> SATA controllers. The hard drive is screwed in (due to the
>> portability part of laptopness), all the controllers wired onto the
>> motherboard are accounted for, none are exposed externally. What
>> _is_ exposed externally is USB, and if you want to add an extra hard
>> drive you can buy a cheap USB one at Fry's.
>
> That may be true for laptops today, but Linux doesn't run just on
> servers. You can easily get home servers with hot-swap SATA bays. My
> home fileserver, which is a white box I purchased on my own nickel,
> NOT IBM big iron, has 3TB of raw storage for less than $10,000 a year
> ago. Today, that amount of home storage with hot-swap SATA drives and
> a battery-backed hardware RAID controller could probably be purchased
> for about half that price.

I also have a 3TB raid I built at home, it uses 3ware cards and a dozen
300G IDE drives. since the 3ware driver is classified as SCSI if a drive
fails all the other drives get renumbered on the next boot and it's
painful to figure out which drive has a problem. I have to reboot and go
into the 3ware BIOS to figure out which drive isn't reporting. This system
also has an adaptec raid card in it and an adaptec regular SCSI card. The
fact that these three cards take different drivers, and so the order of
detection changes the drive numbering is a real pain when I'm installing a
new distro onto it. once I get it installed I compile my own monolithic
kernel and this problem stops becouse the kernel linking order determins
the detection order.

this replaced a 1.2TB raid that I just about filled up, and then stared
having drive failures due to age on. It used 8 160G IDE drives, and when I
had problems with a drive it was easy to see that /dev/hdk was missing
from the set, and I was still able to have a removable drive bay for
/dev/hdc that I could hook my tivo drive into (on a reboot for safety) and
not have things go haywire if I left the bay empty (or switched off) when
I booted.

this may not be hundreds of drives, but it should be enough to show that I
have experianced the pain that some people claim is the reason all of this
must be dynamic with a userspace helper to sort it all out. My take is
that adding the userspace helper and not enumerating things that are easy
to enumerate is making things worse, not better.

> And even for laptops, if you need the performance, you can get Cardbus
> cards that will allow you to connect eSATA drives to your laptop at
> Fry's.
>
> So even if you ignore "big data center" interconnects like FC, this
> problem exists even for commodity grade SATA devices.

but these are seperate SATA buses, while you could run into ordering
issues if you hook multiple devices to one bus, you should be able to have
no ordering issues if you don't have more then one device of a type on any
one bus (you could have a SATA hard drive on the internal PCI controller,
and another one of the Cardbus controller, but if you always order
directly connected devices before cardbus connected devices they will
always show up in the same order)

>> It's necessary for IBM big iron to do this. It's generally not
>> necessary for laptops or embedded systems to do this if they
>> distinguish between _types_ of devices, which is something they
>> until recently did for the types of devices I was interested in, and
>> something they _stopped_ doing when everything got merged into the
>> scsi layer, and I consider this a regression.
>
> As another example, it's easy to see a home media server running Linux
> which doesn't have any expansion bays for additional hard drive --- so
> the only way a user could expand their storage is by using one or more
> permanently connected USB disks. So we do need to solve the general
> device enumeration problem in the general case; it's not just the case
> of IBM "big iron" as you seem to think.

there are two seperate problems here.

1. how to enumerate devices that have a repeatable, stable address.

2. how to enumerate devices that do not.

nobody is saying that there are no cases of #2 and that there is no need
to address that problem, what I, and I think others are saying is that the
solutions to #2 are not perfect, and while they are a reasonable fit for
that case, they are in many ways inferior to simple enumeration for
devices in catagory #1

>> No, distinguishing between types of devices is not a perfect
>> solution to device enumeration, but it was sufficient for all my use
>> cases for many years, and would still be if the kernel still did it,
>> and I'm not alone here.
>
> News flash! The kernel wasn't built just for you, and over time, more
> and more people will have multiple disk drives of the same type, so we
> will need to solve the device naming problem sooner or later. Why not
> solve it sooner, especially given that a number of companies (not just
> IBM) are funding the organization that is paying *your* salary are
> interested in solving the general case?

the kernel wasn't just built for people who have dozens or hundreds of
devices on busses that make enumeration impossible either, why should
their requirements be the only ones considered?

(by the way, I think the crack about who is paying Rob's salary is a
little below the belt)

> Furthermore, I've already pointed a number of situations where the
> home user might have multiple USB devices on their system today, and
> that is probably going to go up over time, not down. Have you seen
> how cheap 500GB USB disks are at Costco? And for a typical
> unsophisticated user, plugging in another 500G USB disk when they need
> more storage is a lot easier than cracking open the computer case and
> futzing with screws and disk cables and power connectors.

so let USB devices use 'best guess' nameing and let other devices use
names based on their fixed addresses/hardware paths.

you could use the suggestion made by Stefan Richter in Message-ID:
<[email protected]> that lets the driver suggest a name
if the system hasn't choosen to override it. Since distros look for
/dev/sd* it should even be able to work without breaking new installs (the
transition would break existing installs, so it would need to be optional)

David Lang

2007-10-16 02:51:15

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007, Greg KH wrote:

> On Mon, Oct 15, 2007 at 05:08:36AM -0500, Rob Landley wrote:
>> On Monday 15 October 2007 4:06:20 am Julian Calaby wrote:
>>> On 10/15/07, Rob Landley <[email protected]> wrote:
>>>> I note that the eth0 and eth1 names are dynamically assigned on a first
>>>> come first serve basis (like scsi). This never causes me a problem
>>>> because the driver loading order is constant, and once you figure out
>>>> that eth0 is gigabit and eth1 is the 80211g it _stays_ that way across
>>>> reboots, reliably. Yeah, it's a heuristic. Hands up everybody relying on
>>>> such a heuristic in the real world.
>>>
>>> Umm, not quite, from my experiences with pre-production wireless
>>> drivers, (another story, another time) fancy stuff is being done in
>>> udev to make sure that your gigabit card is always assigned to eth0.
>>
>> I remember building a 2.4 kernel, statically linking in all the drivers, and
>> getting the ethernet devices showing up in a reliable order for years. Where
>> does the need for fancy stuff come in?
>
> Because PCI devices reorder their bus numbers all the time. And we have
> ethernet devices hanging off of USB connections now (yes, even built-in
> to the machine), and we have network connections on other hot-pluggable
> busses (remember, PCI is hot pluggable.)

do PCI devices reorder their bus numbers spontaniously, or only if you
change the hardware?

> So, the distros need to name network devices in a persistant way, that
> is why the distros now do this. If you don't like the distro doing it,
> complain to them, it's not a kernel issue :)

I have, at least the response was to tell me how to kill this 'feature'
even if they won't change it.

David Lang

2007-10-16 03:03:27

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007, Stefan Richter wrote:

> Subject: Re: What still uses the block layer?
>
> Matthew Wilcox wrote:
>> On Mon, Oct 15, 2007 at 04:26:04AM -0500, Rob Landley wrote:
>>> Combining USB and IDE into the same /dev/sd? namespace makes enumerating the
>>> IDE devices much harder than in the traditional "/dev/hdb doesn't move
>>> without a screwdriver" model. The merger creates a new problem for IDE, one
>>> which didn't exist before: the addition or removal of other unrelated types
>>> of devices may change this device's location next boot. It may be possible
>>> to add additional complication to the system to compensate, but what was the
>>> advantage of merging the namespaces in the first place?
>>
>> It's not something anyone particularly set out to do, it's just how
>> it worked out. It was justified by saying "ok, this goes from a 99%
>> solution to a 96% solution, but there's 100% solution called uuids".
>> I don't particularly agree with this line of argumentation, but it did
>> hold sway.
>
> Low-level networking drivers suggest a default interface name (per
> interface or as a template like eth%d into which the networking core
> inserts a lowest spare number). Userspace can rename interfaces, but
> nevertheless it's nice to have different default kernel names for
> ethernet, wlan etc..
>
> Could low-level SCSI drivers provide similar name templates which give a
> hint on the transport involved? It's a bit more difficult as with
> networking interfaces though because
> - SCSI devices can have sd, sr, st, osst, ch, sg interfaces,
> - SCSI device files share a namespace with all other device files.
>
> E.g.
> /dev/sd-ide-b - second IDE HDD,
> /dev/sd-iscsi-e - fifth iSCSI direct access device,
> /dev/sr-sata-0 - first SATA CD-ROM,
> /dev/sr-usb-0 - a USB CD-ROM,
> /dev/st-fw-0 - a FireWire tape drive,
> /dev/sda - a device whose transport driver didn't propose a name
>
> Of course the really interesting names will still be provided by
> udev-generated symlinks.

this is a nice option, and since most of the existing userspace code is
looking for /dev/sd*, /dev/sr*, etc this should be able to work for new
installs with no userspace changes. Since it would break existing installs
it would need to be optional.

one other option that could be considered (and I do realize I'm bringing
up flame-bait here) is that drivers that have fixed addresses could offer
up a device name that include that address.
i.e. depending on the config option a device could show up as either sda,
sd-scsi-a, sd-scsi-0:0:0:0, or even sd-scsi-<WWN>

if the driver or bus doesn't have a real numbering, it wouldn't invent a
fake one (which is a big problem with most of the prior suggestions that
have tried to offer a numbering option), it would just offer the most
specific information it has.

David Lang

2007-10-16 03:56:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

Nick Piggin <[email protected]> writes:

> On Monday 15 October 2007 18:04, Rob Landley wrote:
>> On Sunday 14 October 2007 8:45:03 pm Theodore Tso wrote:
>
>> > > excuse for conflating different categories of devices in the first
>> > > place.
>> >
>> > See the thinkpad Ultrabay drive example above.
>>
>> Last week I drove my laptop so deep into swap (with a "make -j" on qemu)
>> that after half an hour trying to repaint my kmail window, it locked solid.
>> Again. You'd think the oom killer would come to the rescue, but it didn't.
>> Maybe Ubuntu disabled it. I have _2_gigs_ of ram in this sucker, on a
>> stock Ubuntu 7.04 install (with the "upgrade all" tab pressed a few times),
>> and yet I managed to make it swap itself to death one more time.
>>
>> Virtual memory isn't perfect. I've _always_ been able to come up with
>> examples where it just doesn't work for me. This doesn't mean VM
>> overcommit should be abolished, because it's useful more often than not.
>
> I hate to go completely offtopic here, but disks are so incredibly
> slow when compared to RAM that there is really nothing the kernel
> can do about this. Presumably the job will finish, given infinite
> time.
>
> How much swap do you have configured? You really shouldn't configure
> so much unless you do want the kernel to actually use it all, right?

No.

There are three basic swapping scenarios.
- Pushing unused data out of ram
- Swapping
- Thrashing

To effectively swap you need SWAP > RAM because after a little while of
swapping all of your pages in RAM should be assigned a location in the
page cache.

I have not heard of many people swapping and not thrashing lately.
I think part of the problem is that we do random access to the swap
partition which makes us seek limited. And since the number of
seeks per unit time has been increasing at a linear or slower rate
that if we are doing random disk I/O then the amount we can use
the disk for is very limited. I wonder if we could figure out
how to push and pull 1M or bigger chunks into and out of swap?

I don't know if swap has actually worked since we vmscan stopped
going over the virtual addresses.

> Because if we're not really conservative about OOM killing, then the
> user who actually really did want to use all the swap they configured
> gets angry when we kill their jobs without using it all.

I totally agree. The fact that the OOM killer started is a sign that
the system was completely overwhelmed and nothing better could happen.

In this case my gut feel says limiting the total number of processes
would have been much more effective then anything at all to do with
swap. make -j reminds me of the classic fork bomb.

> Would an oom-kill-someone-now sysrq be of help, I wonder?

Well we have SAQ which should kill everything on your current VT
which should include X and all of it's children.

Eric

2007-10-16 04:04:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 07:54:22PM -0700, [email protected] wrote:
> do PCI devices reorder their bus numbers spontaniously, or only if you
> change the hardware?

The only system I've had that reordered PCI bus numbers was when I had a
partitionable system and changed the partitioning. Not quite "change
the hardware", but neither was it "spontaneous". It was certainly
unexpected (for me).

Greg probably has quite different examples.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-10-16 04:06:30

by David Lang

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Mon, 15 Oct 2007, Eric W. Biederman wrote:

> Nick Piggin <[email protected]> writes:
>
>> How much swap do you have configured? You really shouldn't configure
>> so much unless you do want the kernel to actually use it all, right?
>
> No.
>
> There are three basic swapping scenarios.
> - Pushing unused data out of ram
> - Swapping
> - Thrashing
>
> To effectively swap you need SWAP > RAM because after a little while of
> swapping all of your pages in RAM should be assigned a location in the
> page cache.

on some kernel versions you are correct about needing swap > ram, but on
current versions you are not. the swap space gets allocated as needed, and
re-used as needed (I don't know the mechanism of this, but I remember the
last time this changed from vm=max(ram,swap) to vm=ram+swap)

> I have not heard of many people swapping and not thrashing lately.
> I think part of the problem is that we do random access to the swap
> partition which makes us seek limited. And since the number of
> seeks per unit time has been increasing at a linear or slower rate
> that if we are doing random disk I/O then the amount we can use
> the disk for is very limited. I wonder if we could figure out
> how to push and pull 1M or bigger chunks into and out of swap?

it has been noted by many people that linux is very slow to pull things
back into ram from swap, significantly slower then simple seed limiting
would seem to account for.

Davdi Lang

2007-10-16 04:12:01

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007, Matthew Wilcox wrote:

> On Mon, Oct 15, 2007 at 07:54:22PM -0700, [email protected] wrote:
>> do PCI devices reorder their bus numbers spontaniously, or only if you
>> change the hardware?
>
> The only system I've had that reordered PCI bus numbers was when I had a
> partitionable system and changed the partitioning. Not quite "change
> the hardware", but neither was it "spontaneous". It was certainly
> unexpected (for me).

Ok, I would class that as the equivalent of 'changing the hardware'.

> Greg probably has quite different examples.

I would definantly be interested in hearing some of them. Greg's comment
makes it sound like this is something that (with modern hardware) could
happen to anyone at any time (which, if true, would be sufficiant to
require 'best effort' nameing of devices for everything), while my
experiance is that if the hardware is static (i.e. you don't plugin or
unplug PCI devices) the numbering of exisitng PCI devices and buses is
static. and while I understand that consumer distros want to have
everything 'best effort' named to make it easier for users, I disagree
that this should force everyone to use 'best effort' when there are many
situations where it's unnessasary overhead and chances for errors.

David Lang

2007-10-16 04:13:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007 22:04:01 -0600
Matthew Wilcox <[email protected]> wrote:

> On Mon, Oct 15, 2007 at 07:54:22PM -0700, [email protected] wrote:
> > do PCI devices reorder their bus numbers spontaniously, or only if
> > you change the hardware?
>
> The only system I've had that reordered PCI bus numbers was when I
> had a partitionable system and changed the partitioning. Not quite
> "change the hardware", but neither was it "spontaneous". It was
> certainly unexpected (for me).
>

a very common one is booting your laptop docked (a real dock, not just
a port extender) versus non-docked....

2007-10-16 04:18:15

by Nick Piggin

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:

> > How much swap do you have configured? You really shouldn't configure
> > so much unless you do want the kernel to actually use it all, right?
>
> No.
>
> There are three basic swapping scenarios.
> - Pushing unused data out of ram
> - Swapping
> - Thrashing
>
> To effectively swap you need SWAP > RAM because after a little while of
> swapping all of your pages in RAM should be assigned a location in the
> page cache.

I don't follow your logic. We don't need SWAP > RAM in order to swap
effectively, IMO.


> I have not heard of many people swapping and not thrashing lately.
> I think part of the problem is that we do random access to the swap
> partition which makes us seek limited. And since the number of
> seeks per unit time has been increasing at a linear or slower rate
> that if we are doing random disk I/O then the amount we can use

I don't know if there is a causal relationship there. I mean, I
think it's been a long time since thrashing was ever a viable mode
of operation, right?

Maybe desktops just have less need for swapping now, so nobody sees
it much until something goes _really_ bad. When I'm using my 256MB
machine, unused stuff goes to swap.


> the disk for is very limited. I wonder if we could figure out
> how to push and pull 1M or bigger chunks into and out of swap?

Pulling in 1MB pages can really easily end up compounding the
thrashing problem unless you're very sure a significant amount
of it will be used.


> I don't know if swap has actually worked since we vmscan stopped
> going over the virtual addresses.

I do, and it does ;)


> > Because if we're not really conservative about OOM killing, then the
> > user who actually really did want to use all the swap they configured
> > gets angry when we kill their jobs without using it all.
>
> I totally agree. The fact that the OOM killer started is a sign that
> the system was completely overwhelmed and nothing better could happen.
>
> In this case my gut feel says limiting the total number of processes
> would have been much more effective then anything at all to do with
> swap. make -j reminds me of the classic fork bomb.

Yep.


> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>
> Well we have SAQ which should kill everything on your current VT
> which should include X and all of it's children.

Which is exactly what you don't want to do if you've just forkbombed
yourself. I missed the fact that we now have a manual oom kill...

2007-10-16 04:21:27

by Greg KH

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, Oct 15, 2007 at 10:04:01PM -0600, Matthew Wilcox wrote:
> On Mon, Oct 15, 2007 at 07:54:22PM -0700, [email protected] wrote:
> > do PCI devices reorder their bus numbers spontaniously, or only if you
> > change the hardware?
>
> The only system I've had that reordered PCI bus numbers was when I had a
> partitionable system and changed the partitioning. Not quite "change
> the hardware", but neither was it "spontaneous". It was certainly
> unexpected (for me).
>
> Greg probably has quite different examples.

Changing the hardware (adding a new PCI device or removing one) are the
most common times this happens. But I have seen reports of this
happening when you upgrade/downgrade BIOS versions, and, in some
oops-we-messed-up cases, when we changed things in the kernel.

thanks,

greg k-h

2007-10-16 04:39:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

Nick Piggin <[email protected]> writes:

> On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:
>> Nick Piggin <[email protected]> writes:
>
>> > How much swap do you have configured? You really shouldn't configure
>> > so much unless you do want the kernel to actually use it all, right?
>>
>> No.
>>
>> There are three basic swapping scenarios.
>> - Pushing unused data out of ram
>> - Swapping
>> - Thrashing
>>
>> To effectively swap you need SWAP > RAM because after a little while of
>> swapping all of your pages in RAM should be assigned a location in the
>> page cache.
>
> I don't follow your logic. We don't need SWAP > RAM in order to swap
> effectively, IMO.

The steady state of a system that is heavily and usably swapping but
not thrashing is that all of the pages in RAM are in the swap cache,
at least that used to be the case.

>> I have not heard of many people swapping and not thrashing lately.
>> I think part of the problem is that we do random access to the swap
>> partition which makes us seek limited. And since the number of
>> seeks per unit time has been increasing at a linear or slower rate
>> that if we are doing random disk I/O then the amount we can use
>
> I don't know if there is a causal relationship there. I mean, I
> think it's been a long time since thrashing was ever a viable mode
> of operation, right?

Right. But swapping heavily has been a viable mode of operation
and that the vast gap in disk random IO performance seems to have
hurt significantly.

It be very clear is used to able to run a problem at little below
full speed with the disk pegged with swap traffic, and I did this
regularly when I started out with linux.

> Maybe desktops just have less need for swapping now, so nobody sees
> it much until something goes _really_ bad. When I'm using my 256MB
> machine, unused stuff goes to swap.

There is a bit of truth in the fact that there is less need for
swapping now. At the same time however swapping simply does not
work well right now, and I'm not at all certain why.

>> the disk for is very limited. I wonder if we could figure out
>> how to push and pull 1M or bigger chunks into and out of swap?
>
> Pulling in 1MB pages can really easily end up compounding the
> thrashing problem unless you're very sure a significant amount
> of it will be used.

It's a hard call. The I/O time for 1MB of contiguous disk data
is about the I/O time of 512 bytes of contiguous disk data.

>> I don't know if swap has actually worked since we vmscan stopped
>> going over the virtual addresses.
>
> I do, and it does ;)

Really? Not just the pushing of unused stuff into swap.


>> > Because if we're not really conservative about OOM killing, then the
>> > user who actually really did want to use all the swap they configured
>> > gets angry when we kill their jobs without using it all.
>>
>> I totally agree. The fact that the OOM killer started is a sign that
>> the system was completely overwhelmed and nothing better could happen.
>>
>> In this case my gut feel says limiting the total number of processes
>> would have been much more effective then anything at all to do with
>> swap. make -j reminds me of the classic fork bomb.
>
> Yep.
>
>
>> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>>
>> Well we have SAQ which should kill everything on your current VT
>> which should include X and all of it's children.
>
> Which is exactly what you don't want to do if you've just forkbombed
> yourself. I missed the fact that we now have a manual oom kill...

You probably have a point there.

Eric

2007-10-16 04:46:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

[email protected] writes:

>
> on some kernel versions you are correct about needing swap > ram, but on current
> versions you are not. the swap space gets allocated as needed, and re-used as
> needed (I don't know the mechanism of this, but I remember the last time this
> changed from vm=max(ram,swap) to vm=ram+swap)

I don't think I can recall a linux kernel that required swap > ram.
However for serious swapping under linux having swap > ram was very
useful and pretty much a requirement for a workload that involved
swapping heavily (not thrashing).

>> I have not heard of many people swapping and not thrashing lately.
>> I think part of the problem is that we do random access to the swap
>> partition which makes us seek limited. And since the number of
>> seeks per unit time has been increasing at a linear or slower rate
>> that if we are doing random disk I/O then the amount we can use
>> the disk for is very limited. I wonder if we could figure out
>> how to push and pull 1M or bigger chunks into and out of swap?
>
> it has been noted by many people that linux is very slow to pull things back
> into ram from swap, significantly slower then simple seed limiting would seem to
> account for.

Yes. It may be the large amount of random access (my current guess)
or it may be something else.

I'm wonder if I should build an application with a configurable
data set and working set that can be used for swap testing. I don't
think it would be very hard and it might help sort through some of
the swap performance problems.

Eric



2007-10-16 04:52:47

by Nick Piggin

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Tuesday 16 October 2007 14:38, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:

> > I don't follow your logic. We don't need SWAP > RAM in order to swap
> > effectively, IMO.
>
> The steady state of a system that is heavily and usably swapping but
> not thrashing is that all of the pages in RAM are in the swap cache,
> at least that used to be the case.

Yeah, it works better in 2.6 (and, IIRC later 2.4 kernels).


> > I don't know if there is a causal relationship there. I mean, I
> > think it's been a long time since thrashing was ever a viable mode
> > of operation, right?
>
> Right. But swapping heavily has been a viable mode of operation
> and that the vast gap in disk random IO performance seems to have
> hurt significantly.

Or, just not improved as fast as everything else is improving.
There isn't too much the kernel can do about that. It just
relatively changes the point at which you'd consider "swapping
heavily", right?


> It be very clear is used to able to run a problem at little below
> full speed with the disk pegged with swap traffic, and I did this
> regularly when I started out with linux.

I can do this now. In make -jhuge tests for example, you can get
a 4GB, 4 core machine to max out a disk with swapping and still
have 0 idle time. Of course you can also go past that point and
your idle time comes up. That's not new though.


> > Maybe desktops just have less need for swapping now, so nobody sees
> > it much until something goes _really_ bad. When I'm using my 256MB
> > machine, unused stuff goes to swap.
>
> There is a bit of truth in the fact that there is less need for
> swapping now. At the same time however swapping simply does not
> work well right now, and I'm not at all certain why.
>
> >> the disk for is very limited. I wonder if we could figure out
> >> how to push and pull 1M or bigger chunks into and out of swap?
> >
> > Pulling in 1MB pages can really easily end up compounding the
> > thrashing problem unless you're very sure a significant amount
> > of it will be used.
>
> It's a hard call. The I/O time for 1MB of contiguous disk data
> is about the I/O time of 512 bytes of contiguous disk data.

And if you're thrashing, then by definition you need to throw
out 1MB of your working set in order to read it in.


> >> I don't know if swap has actually worked since we vmscan stopped
> >> going over the virtual addresses.
> >
> > I do, and it does ;)
>
> Really? Not just the pushing of unused stuff into swap.

We had several bugs and things that caused swapping performance
regressions vs 2.4 in earlyish 2.6. After those were fixed, we're
pretty competitive with 2.4 in some basic tests I was using. I
haven't run them for a fair while, so something might have broken
since then, I don't know.

2007-10-16 04:57:15

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007, Greg KH wrote:

> On Mon, Oct 15, 2007 at 10:04:01PM -0600, Matthew Wilcox wrote:
>> On Mon, Oct 15, 2007 at 07:54:22PM -0700, [email protected] wrote:
>>> do PCI devices reorder their bus numbers spontaniously, or only if you
>>> change the hardware?
>>
>> The only system I've had that reordered PCI bus numbers was when I had a
>> partitionable system and changed the partitioning. Not quite "change
>> the hardware", but neither was it "spontaneous". It was certainly
>> unexpected (for me).
>>
>> Greg probably has quite different examples.
>
> Changing the hardware (adding a new PCI device or removing one) are the
> most common times this happens. But I have seen reports of this
> happening when you upgrade/downgrade BIOS versions, and, in some
> oops-we-messed-up cases, when we changed things in the kernel.

BIOS upgrades qualify as changing hardware (or close to it)

oops-we-messed-up cases of kernel changes don't justify 'best effort'
nameing, it's a regression that needs to be fixed.

now the other example given of docking a laptop is closer to reasonable
(and is definantly a reason to have 'best effort' nameing as an option),
but that's still a relativly special case, and it _is_ definantly
changeing the hardware

David Lang

2007-10-16 05:57:54

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

[email protected] wrote:
> On Mon, 15 Oct 2007, Stefan Richter wrote:
>> Low-level networking drivers suggest a default interface name (per
>> interface or as a template like eth%d into which the networking core
>> inserts a lowest spare number).
...
>> Could low-level SCSI drivers provide similar name templates which give a
>> hint on the transport involved?
...
> one other option that could be considered (and I do realize I'm bringing
> up flame-bait here) is that drivers that have fixed addresses could
> offer up a device name that include that address.
...

That's already implemented. :-) Transport drivers expose transport
specific information in sysfs; udev scripts examine it and create by-id
and by-path symlinks to device files of HDDs. Not everybody agrees, but
many think that it's sensible to implement just mechanism in kernel and
leave policy to userspace. My suggestion and the default network
interface names already violate this principle to a degree, but it can
still be implemented in a transport independent way, and userspace can
continue to create whatever names the user needs.
--
Stefan Richter
-=====-=-=== =-=- =----
http://arcgraph.de/sr/

2007-10-16 06:22:21

by David Newall

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

Nick Piggin wrote:
> On Monday 15 October 2007 19:52, Rob Landley wrote:
>
>> On Monday 15 October 2007 8:37:44 am Nick Piggin wrote:
>>
>>> You really shouldn't configure
>>> so much [swap] unless you do want the kernel to actually use it all, right?
>>>
>> Two words: "Software suspend". I've actually been thinking of increasing
>> it on the next install...
>>
>
> Kernel doesn't know that you want to use it for suspend but not
> regular swapping, unfortunately.
>

Couldn't you mount swap before suspend and unmount it after resume?

2007-10-16 06:34:44

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

Jeff Garzik wrote:
> Greg KH wrote:
>> On Mon, Oct 15, 2007 at 03:36:15AM -0500, Rob Landley wrote:
>>> The point I was trying to make is that it seems to me like it would
>>> be possible to keep the namespace separate here, and thus reduce the
>>> enumeration problems to the point where common cases (like my laptop)
>>> aren't impacted by them during early boot.
>>
>> Proposals on how to do this would be gladly reviewed.
>
> Agreed.

- move the networking core's facilities to build the default name of
an interface into lib/
- expand it to optionally use base-26 numbering (a...nn...zzz) as
alternative to decimal numbering
- let SCSI low-level drivers optionally provide a short constant
string, resembling its transport name, in the host template or
transport template
- let SCSI high-level driver make use of the new naming functions in
lib/, providing either just "sd", "sr" etc. or "sd-$transport-" as
name prefix

No patch yet, and alas I'm currently short of spare time.
--
Stefan Richter
-=====-=-=== =-=- =----
http://arcgraph.de/sr/

2007-10-16 06:39:20

by Rob Landley

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Monday 15 October 2007 11:38:33 pm Eric W. Biederman wrote:
> > I don't follow your logic. We don't need SWAP > RAM in order to swap
> > effectively, IMO.
>
> The steady state of a system that is heavily and usably swapping but
> not thrashing is that all of the pages in RAM are in the swap cache,
> at least that used to be the case.

Mind if I throw in some vague and questionable numbers? :)

I vaguely recall that my old 486 laptop with 16 megabyes of ram (circa 1998)
used to be able to do 3 point something megabytes per second to/from disk,
according to hdparm -t. (That was with DMA enabled.)

This means that my old laptop, using sequential writes and not being bogged
down by excessive seeking, could write its entire memory contents to disk and
read it back in again in about 10 seconds total (5 write, 5 read).

My current laptop has 2 gigabytes of ram, and hdparm -t /dev/sda says:
/dev/sda:
Timing buffered disk reads: 116 MB in 3.01 seconds = 38.54 MB/sec

So that's a little over a factor of 10 speed improvement. (Although I note
that I got 30 megabytes/second off of an ATA/100 adapter in 2002, so it's
barely any faster than it was 5 years ago.)

This means I can expect my current laptop to write out its memory in 50
seconds (2000/40), and another 50 seconds to read it back in.

So 10 seconds to cycle through memory 10 years ago, vs a little under 2
minutes today, on systems at roughly the same price point. And that's
limited by what the hardware is doing, assuming a _perfect_ linear read/write
pattern with no seeks.

Oh, and my old 486 had its RAM maxed out. This one can hold twice as much.
And heavy seeking sucks more than it used to relative to sequential reads by
something like a proportional amount (hence the rise of I/O elevators as a
mitigation strategy), although I haven't got numbers for that handy.

> > I don't know if there is a causal relationship there. I mean, I
> > think it's been a long time since thrashing was ever a viable mode
> > of operation, right?
>
> Right. But swapping heavily has been a viable mode of operation
> and that the vast gap in disk random IO performance seems to have
> hurt significantly.
>
> It be very clear is used to able to run a problem at little below
> full speed with the disk pegged with swap traffic, and I did this
> regularly when I started out with linux.

The problem is the gap is getting bigger. The 486-75 laptop mentioned above
had a 25 mhz 32 bit front side bus. A quick google suggests my core 2 duo
has a 667 mhz FSB and I'm guessing a 128 bit data path (two 64-bit channels).

I could boot up memtest86 and get actual benchmarks, but total handwaving for
a moment, 25*32=800 and 667*128=85376, and the second divided by the first is
over 100 times as big. That concurs with the 16mhz->1733 mhz processor speed
increase.

Factor of 10 disk speed increase, factor of 100 memory speed increase. Disks
speeds aren't keeping up with processor and memory increases. Disk _sizes_
are, but speeds aren't.

> > Maybe desktops just have less need for swapping now, so nobody sees
> > it much until something goes _really_ bad. When I'm using my 256MB
> > machine, unused stuff goes to swap.
>
> There is a bit of truth in the fact that there is less need for
> swapping now. At the same time however swapping simply does not
> work well right now, and I'm not at all certain why.

Do the numbers above help? It'll only get worse, unless some random new
technology (maybe http://en.wikipedia.org/wiki/MRAM or something) swoops in
to change everything, again.

> >> the disk for is very limited. I wonder if we could figure out
> >> how to push and pull 1M or bigger chunks into and out of swap?
> >
> > Pulling in 1MB pages can really easily end up compounding the
> > thrashing problem unless you're very sure a significant amount
> > of it will be used.
>
> It's a hard call. The I/O time for 1MB of contiguous disk data
> is about the I/O time of 512 bytes of contiguous disk data.

Hence "the seek sucking even more now" part. :(

I'm sure somebody will eventually write an OLS paper or something on the
advisability of making swapping decisions with 4k granularity when disks
really want bigger I/O transactions. Maybe they already have, somewhere
between:
http://kernel.org/doc/ols/2007/ols2007v1-pages-53-64.pdf
and
http://kernel.org/doc/ols/2007/ols2007v1-pages-277-284.pdf

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-16 09:33:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

Rob Landley <[email protected]> writes:

> On Monday 15 October 2007 11:38:33 pm Eric W. Biederman wrote:
>> > I don't follow your logic. We don't need SWAP > RAM in order to swap
>> > effectively, IMO.
>>
>> The steady state of a system that is heavily and usably swapping but
>> not thrashing is that all of the pages in RAM are in the swap cache,
>> at least that used to be the case.
>
> Mind if I throw in some vague and questionable numbers? :)
>
> I vaguely recall that my old 486 laptop with 16 megabyes of ram (circa 1998)
> used to be able to do 3 point something megabytes per second to/from disk,
> according to hdparm -t. (That was with DMA enabled.)
>
> This means that my old laptop, using sequential writes and not being bogged
> down by excessive seeking, could write its entire memory contents to disk and
> read it back in again in about 10 seconds total (5 write, 5 read).
>
> My current laptop has 2 gigabytes of ram, and hdparm -t /dev/sda says:
> /dev/sda:
> Timing buffered disk reads: 116 MB in 3.01 seconds = 38.54 MB/sec
>
> So that's a little over a factor of 10 speed improvement. (Although I note
> that I got 30 megabytes/second off of an ATA/100 adapter in 2002, so it's
> barely any faster than it was 5 years ago.)
>
> This means I can expect my current laptop to write out its memory in 50
> seconds (2000/40), and another 50 seconds to read it back in.
>
> So 10 seconds to cycle through memory 10 years ago, vs a little under 2
> minutes today, on systems at roughly the same price point. And that's
> limited by what the hardware is doing, assuming a _perfect_ linear read/write
> pattern with no seeks.
>
> Oh, and my old 486 had its RAM maxed out. This one can hold twice as much.
> And heavy seeking sucks more than it used to relative to sequential reads by
> something like a proportional amount (hence the rise of I/O elevators as a
> mitigation strategy), although I haven't got numbers for that handy.
>
>> > I don't know if there is a causal relationship there. I mean, I
>> > think it's been a long time since thrashing was ever a viable mode
>> > of operation, right?
>>
>> Right. But swapping heavily has been a viable mode of operation
>> and that the vast gap in disk random IO performance seems to have
>> hurt significantly.
>>
>> It be very clear is used to able to run a problem at little below
>> full speed with the disk pegged with swap traffic, and I did this
>> regularly when I started out with linux.
>
> The problem is the gap is getting bigger. The 486-75 laptop mentioned above
> had a 25 mhz 32 bit front side bus. A quick google suggests my core 2 duo
> has a 667 mhz FSB and I'm guessing a 128 bit data path (two 64-bit channels).

I'm pretty certain Intels' arechitecture is only has a 64bit front side bus.
Of course I'm used to seeing it clocked a bit higher.

> I could boot up memtest86 and get actual benchmarks, but total handwaving for
> a moment, 25*32=800 and 667*128=85376, and the second divided by the first is
> over 100 times as big. That concurs with the 16mhz->1733 mhz processor speed
> increase.
>
> Factor of 10 disk speed increase, factor of 100 memory speed increase. Disks
> speeds aren't keeping up with processor and memory increases. Disk _sizes_
> are, but speeds aren't.

Exactly.

>> > Maybe desktops just have less need for swapping now, so nobody sees
>> > it much until something goes _really_ bad. When I'm using my 256MB
>> > machine, unused stuff goes to swap.
>>
>> There is a bit of truth in the fact that there is less need for
>> swapping now. At the same time however swapping simply does not
>> work well right now, and I'm not at all certain why.
>
> Do the numbers above help? It'll only get worse, unless some random new
> technology (maybe http://en.wikipedia.org/wiki/MRAM or something) swoops in
> to change everything, again.

Well it will be interesting to see what happens with NAND flash. So far
it is pricey but you can easily make it faster then todays hard drives.
Capacity is still coming.

>> >> the disk for is very limited. I wonder if we could figure out
>> >> how to push and pull 1M or bigger chunks into and out of swap?
>> >
>> > Pulling in 1MB pages can really easily end up compounding the
>> > thrashing problem unless you're very sure a significant amount
>> > of it will be used.
>>
>> It's a hard call. The I/O time for 1MB of contiguous disk data
>> is about the I/O time of 512 bytes of contiguous disk data.
>
> Hence "the seek sucking even more now" part. :(

> I'm sure somebody will eventually write an OLS paper or something on the
> advisability of making swapping decisions with 4k granularity when disks
> really want bigger I/O transactions. Maybe they already have, somewhere
> between:
> http://kernel.org/doc/ols/2007/ols2007v1-pages-53-64.pdf
> and
> http://kernel.org/doc/ols/2007/ols2007v1-pages-277-284.pdf

An interesting point. What would really impress me is actually finding
a current work load that can productively swap after everything kernel
side is fixed up and optimized. So far it seems like real swapping is so
painful that everyone is simply avoiding it.

Eric

2007-10-16 10:18:29

by Alan

[permalink] [raw]
Subject: Re: What still uses the block layer?

> > /dev/sd-ide-b - second IDE HDD,
> > /dev/sr-sata-0 - first SATA CD-ROM,

I wouldn't try dividing those by pata v sata. You'll cause all sorts of
problems in the process because of PATA-SATA and SATA-PATA bridges.

2007-10-16 10:26:30

by Alan

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

> I'm sure somebody will eventually write an OLS paper or something on the
> advisability of making swapping decisions with 4k granularity when disks
> really want bigger I/O transactions.

Funnily enough someone thought of that many years ago. They even added
and documented it, then they made it adjustable.

See the vm section of Documentation/filesystems/proc.txt

Alan

2007-10-16 19:51:31

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Tue, 16 Oct 2007, Alan Cox wrote:

>>> /dev/sd-ide-b - second IDE HDD,
>>> /dev/sr-sata-0 - first SATA CD-ROM,
>
> I wouldn't try dividing those by pata v sata. You'll cause all sorts of
> problems in the process because of PATA-SATA and SATA-PATA bridges.

if you use a PATA-SATA bridge (IDE drive SATA controller), it would look
to the system like a SATA drive and be addressed and enumerated as SATA.

if you use a SATA-PATA bridge (SATA drive, PATA controller), it would look
to the system like a PATA drive and be addressed and enumerated as PATA.
prior to libata the device would be /dev/hdX, with X depending on how it's
cabled and if it's set to master or slave, it wouldn't matter if that
device then converts to other things, the system would still know it as an
IDE drive.

this works exactly the same way that external encosures that hold SATA or
IDE drives, but have SCSI interfaces to the system have always worked so
it's what sysadmins will expect.

David Lang

2007-10-16 19:54:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Tue, Oct 16, 2007 at 12:54:58PM -0700, [email protected] wrote:
> On Tue, 16 Oct 2007, Alan Cox wrote:
> >I wouldn't try dividing those by pata v sata. You'll cause all sorts of
> >problems in the process because of PATA-SATA and SATA-PATA bridges.
>
> if you use a PATA-SATA bridge (IDE drive SATA controller), it would look
> to the system like a SATA drive and be addressed and enumerated as SATA.

But you don't know where the bridge is. It might be on the drive's
board, it might be an explicit enclosure, or it might be on the
motherboard. Each of those scenarios is going to have a different user
expectation.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2007-10-16 20:19:15

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

Matthew Wilcox wrote:
> On Tue, Oct 16, 2007 at 12:54:58PM -0700, [email protected] wrote:
>> On Tue, 16 Oct 2007, Alan Cox wrote:
>>> I wouldn't try dividing those by pata v sata. You'll cause all sorts of
>>> problems in the process because of PATA-SATA and SATA-PATA bridges.
>> if you use a PATA-SATA bridge (IDE drive SATA controller), it would look
>> to the system like a SATA drive and be addressed and enumerated as SATA.
>
> But you don't know where the bridge is. It might be on the drive's
> board, it might be an explicit enclosure, or it might be on the
> motherboard. Each of those scenarios is going to have a different user
> expectation.

If the bridge is on the drive's board or in an enclosure, the user's
expectations are fully met.

If the bridge is on the motherboard, then the user may be surprised
unless he knows the motherboard well enough.

But this is _far_ less of an issue than
- the hda<->sda confusion,
- the confusion caused by all kernel default names put into a single
namespace.

I don't have a personal interest in PATA/SATA distinction though. I
suppose once PATA went into the SCSI namespace and then this namespace
is divided again, it's not a big issue anymore whether PATA and SATA
share an ATA namespace or are distinct, except perhaps for people with
IDE drive and eSATA slots.
--
Stefan Richter
-=====-=-=== =-=- =----
http://arcgraph.de/sr/

2007-10-16 20:35:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Tue, Oct 16, 2007 at 01:54:33PM -0600, Matthew Wilcox wrote:
> On Tue, Oct 16, 2007 at 12:54:58PM -0700, [email protected] wrote:
> > On Tue, 16 Oct 2007, Alan Cox wrote:
> > >I wouldn't try dividing those by pata v sata. You'll cause all sorts of
> > >problems in the process because of PATA-SATA and SATA-PATA bridges.
> >
> > if you use a PATA-SATA bridge (IDE drive SATA controller), it would look
> > to the system like a SATA drive and be addressed and enumerated as SATA.
>
> But you don't know where the bridge is. It might be on the drive's
> board, it might be an explicit enclosure, or it might be on the
> motherboard. Each of those scenarios is going to have a different user
> expectation.

And worse yet, depending on what BIOS options you set at config time,
or what might happen after you upgrade the BIOS, whether the drive
looks like PATA or SATA could change over time. So if you have
/dev/hda hard-coded in your /etc/fstab file, you could and probably
will potentially lose after you change a BIOS option or take a BIOS
upgrade causing the BIOS configs to get resent and disabling PATA
emulation, such that your disk that had previously been /dev/hda now
shows up as /dev/sda. (And this is something you will very badly
*want* since your disk drive access will be **much** faster once you
stop using PATA emulation.)

Yet another reason why people who desperately are trying to cling to
the good old days of stable device enumerations are going to be
disappointed; the *type* of the drive can change over time, even for
something as simple as a laptop's primary hard drive, which seem to be
some people's favorite example.

Unfortunately, people are just going to have to suck it up and get
used to a much more complicated world.

- Ted

2007-10-16 20:52:37

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Tue, 16 Oct 2007, Matthew Wilcox wrote:

> On Tue, Oct 16, 2007 at 12:54:58PM -0700, [email protected] wrote:
>> On Tue, 16 Oct 2007, Alan Cox wrote:
>>> I wouldn't try dividing those by pata v sata. You'll cause all sorts of
>>> problems in the process because of PATA-SATA and SATA-PATA bridges.
>>
>> if you use a PATA-SATA bridge (IDE drive SATA controller), it would look
>> to the system like a SATA drive and be addressed and enumerated as SATA.
>
> But you don't know where the bridge is. It might be on the drive's
> board, it might be an explicit enclosure, or it might be on the
> motherboard. Each of those scenarios is going to have a different user
> expectation.

the only one of these that I would find unexpected would be the one on the
motherboard.

why is this any different from the external enclosures? they have always
appeared as the type of device that connects them to the motherboard, (and
even with SCSI, there are some controllers that don't generate sdX
devices)

the driver for the controller is what has historicly determined what the
device appears as to the system. an example of this is the 3ware driver
that is a SCSI drive but the drives attached to the card are IDE drives.
another example is the I2O drivers (which give you access to the Raid
array and to the individual drives, in different namespaces). while I may
disagree with some of the selections that have been made (the 3ware has
always seemed odd to me for example) it's pretty simple to figure out.

but in any case, historicly IDE (PATA) and SATA drives have been handled
differently, IDE drives have had fixed device names based on how they are
connected, SATA devices have had 'order found' device names from the SCSI
heritige. mixing the two types into one namespace requires changing one or
the other. while I would love to see SATA gain hardware path dependant
names I'm not holding my breath, but I hate to loose the predictable
nameing (even if the names change) for the IDE drives.

David Lang

2007-10-16 20:57:23

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

Theodore Tso wrote:
> Yet another reason why people who desperately are trying to cling to
> the good old days of stable device enumerations are going to be
> disappointed;

Sure enough; stable device enumeration is a thing of the past.

This doesn't have to stop us though from providing speaking default
names for device files, just like we already provide speaking default
names for network interfaces. (Not for all, but for many.)
--
Stefan Richter
-=====-=-=== =-=- =----
http://arcgraph.de/sr/

2007-10-16 21:36:37

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Mon, 15 Oct 2007 23:37:44 +1000
Nick Piggin <[email protected]> wrote:

> Would an oom-kill-someone-now sysrq be of help, I wonder?

Is already there: sysrq-f.

2007-10-16 21:47:36

by Alan

[permalink] [raw]
Subject: Re: What still uses the block layer?

> but in any case, historicly IDE (PATA) and SATA drives have been handled
> differently, IDE drives have had fixed device names based on how they are
> connected, SATA devices have had 'order found' device names from the SCSI

Nope.

Historically it depended whether you had a PATA controller with SATA
bridge, a SATA controller with SATA drives, a PATA controller with PATA
drives or a SATA controller with PATA bridge.

Often the bridges are on the card or mainboard. So some VIA systems would
historically use /dev/hda for the first SATA device.

Even more fun is stuff like Jmicron where the BIOS settings determined
whether PATA or SATA was /dev/hda

Alan

2007-10-17 00:00:43

by Rob Landley

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Tuesday 16 October 2007 5:28:59 am Alan Cox wrote:
> > I'm sure somebody will eventually write an OLS paper or something on the
> > advisability of making swapping decisions with 4k granularity when disks
> > really want bigger I/O transactions.
>
> Funnily enough someone thought of that many years ago. They even added
> and documented it, then they made it adjustable.
>
> See the vm section of Documentation/filesystems/proc.txt

I presume you refer to:

page-cluster
------------

page-cluster controls the number of pages which are written to swap in
a single attempt. The swap I/O size.

It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.

The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.

I didn't know that controlled whether the pages were contiguous (or written to
contiguous locations in swap). I thought it was just how many the VM tried
to free at a time.

Still, worth a tweak. Thanks.

> Alan

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-17 05:35:53

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Mon, 15 Oct 2007 03:04:00 CDT, Rob Landley said:
> I note that the eth0 and eth1 names are dynamically assigned on a first come
> first serve basis (like scsi). This never causes me a problem because the
> driver loading order is constant, and once you figure out that eth0 is
> gigabit and eth1 is the 80211g it _stays_ that way across reboots, reliably.
> Yeah, it's a heuristic. Hands up everybody relying on such a heuristic in
> the real world.

I've gotten burned by that heuristic enough times to not rely on it. My last
laptop had an ethernet on the motherboard, a *separate* ethernet in the docking
station, an ethernet on a multifunction pcmcia card (I usually just used the
modem side), and a wireless that looked like an ethernet - so it was possible
for a given interface to be eth1 (if no dock and no pcmcia card) or eth3 (if
both were present). And that's on a laptop from almost 5 years ago.

And then there's the recent Sun and Dell 1U rack-mounts that have 4 ethernets
on the motherboard, and they *never* seem to assign in a 0,1,2,3 order that
matches the 0 1 2 3 printed above the 4 RJ45's ;)

So I have for years been a proponent of 'ethN is nailed by MAC address' :)


Attachments:
(No filename) (226.00 B)

2007-10-17 06:04:28

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Wed, 17 Oct 2007, [email protected] wrote:

> On Mon, 15 Oct 2007 03:04:00 CDT, Rob Landley said:
>> I note that the eth0 and eth1 names are dynamically assigned on a first come
>> first serve basis (like scsi). This never causes me a problem because the
>> driver loading order is constant, and once you figure out that eth0 is
>> gigabit and eth1 is the 80211g it _stays_ that way across reboots, reliably.
>> Yeah, it's a heuristic. Hands up everybody relying on such a heuristic in
>> the real world.
>
> I've gotten burned by that heuristic enough times to not rely on it. My last
> laptop had an ethernet on the motherboard, a *separate* ethernet in the docking
> station, an ethernet on a multifunction pcmcia card (I usually just used the
> modem side), and a wireless that looked like an ethernet - so it was possible
> for a given interface to be eth1 (if no dock and no pcmcia card) or eth3 (if
> both were present). And that's on a laptop from almost 5 years ago.
>
> And then there's the recent Sun and Dell 1U rack-mounts that have 4 ethernets
> on the motherboard, and they *never* seem to assign in a 0,1,2,3 order that
> matches the 0 1 2 3 printed above the 4 RJ45's ;)
>
> So I have for years been a proponent of 'ethN is nailed by MAC address' :)

on the other hand, I have two systems in my lab with identical hardware,
loaded with the same OS image, but one calls the interfaces eth0, eth1,
eth2 while the other calls them eth12, eth13, eth14 becouse it had three
quad cards installed in it for a few days several months ago.

also think what happens to a system if you replace a failed NIC with an
card identical except the MAC addresses. instead of everything just
working as before, you now have new ethX devices and are missing the old
ethX devices.

both ways of doing things can yield nonsense results in cases where the
other one gives perfectly useable results.

nobody is arguing that the ability to nail things down by MAC address
(or drives by UUID) should be removed, we're just arguing that the option
to get useable consistant names from hardware that is consistant is being
removed and that it shouldn't be, it has it's place just like the 'best
effort' naming.

David Lang

2007-10-17 09:48:57

by Gabor Gombas

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Tue, Oct 16, 2007 at 01:55:07PM -0700, [email protected] wrote:

> why is this any different from the external enclosures? they have always
> appeared as the type of device that connects them to the motherboard, (and
> even with SCSI, there are some controllers that don't generate sdX devices)

In the past enclosures supported only one kind of connector so this
assumption was fine. But nowadays an external disk may have several
connectors (like USB, Firewire and eSata). Why should the disk's name
depend on what type of cable did I manage to grab first? It is the
_same_ disk regardless of the cable type.

There is one thing however that could be improved: renaming a disk in an
udev rule should propagate the new name back to the kernel, just like
renaming an ethernet interface does. That way mapping error messages to
physical disk locations could be made much easier.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2007-10-17 17:25:35

by Stefan Richter

[permalink] [raw]
Subject: Re: What still uses the block layer?

Gabor Gombas wrote:
> On Tue, Oct 16, 2007 at 01:55:07PM -0700, [email protected] wrote:
>> why is this any different from the external enclosures? they have always
>> appeared as the type of device that connects them to the motherboard, (and
>> even with SCSI, there are some controllers that don't generate sdX devices)
>
> In the past enclosures supported only one kind of connector so this
> assumption was fine. But nowadays an external disk may have several
> connectors (like USB, Firewire and eSata). Why should the disk's name
> depend on what type of cable did I manage to grab first? It is the
> _same_ disk regardless of the cable type.

Yes, but even udev won't give you one and the same symlink to the disk's
device file then.? There isn't a persistent unique target/unit property
which all of these transports have in common.

The only thing that could be common in the best case is the symlink to
the partition's device file, based on filesystem UUID or filesystem label.

?) unless you write your own rule specific to this on particular enclosure
--
Stefan Richter
-=====-=-=== =-=- =---=
http://arcgraph.de/sr/

2007-10-17 21:01:29

by David Lang

[permalink] [raw]
Subject: Re: What still uses the block layer?

On Wed, 17 Oct 2007, Gabor Gombas wrote:

> On Tue, Oct 16, 2007 at 01:55:07PM -0700, [email protected] wrote:
>
>> why is this any different from the external enclosures? they have always
>> appeared as the type of device that connects them to the motherboard, (and
>> even with SCSI, there are some controllers that don't generate sdX devices)
>
> In the past enclosures supported only one kind of connector so this
> assumption was fine. But nowadays an external disk may have several
> connectors (like USB, Firewire and eSata). Why should the disk's name
> depend on what type of cable did I manage to grab first? It is the
> _same_ disk regardless of the cable type.

the right type for the type of cable you choose to use. yes it's the same
disk, but by choosing to hook it up in a different way you get different
results from it (different performance, different predictability)

again, if you want to have a udev rule that then maps these different name
onto the same name, more power to you, but why do you insist on makeing
_everyone_ work that way (or go to significant extra effort to find the
info in the changing directory structure of sysfs to track down the info
that you throw away)

> There is one thing however that could be improved: renaming a disk in an
> udev rule should propagate the new name back to the kernel, just like
> renaming an ethernet interface does. That way mapping error messages to
> physical disk locations could be made much easier.

definantly.

David Lang

2007-10-17 23:35:38

by Bill Davidsen

[permalink] [raw]
Subject: Re: What still uses the block layer?

Jeff Garzik wrote:

>> But again, please remember that these USB devices are really SCSI
>> devices. Same for SATA devices. There is a reason they are using the
>> SCSI layer, and it isn't just because the developers felt like it :)
>
> /somewhat/ true I'm afraid: libata uses the SCSI layer for ATAPI
> devices because they are essentially bridges to SCSI devices. It uses
> the SCSI layer for ATA devices because the SCSI layer provided a huge
> amount of infrastructure that would need to have been otherwise
> duplicated, /then/ massaged into coordinating between <jgarzik's ATA
> layer> and <SCSI layer> when dealing with ATAPI.
>
> There is also a detail that was of /huge/ value when introducing a new
> device class: distro installers automatically work, if you use SCSI. If
> you use a new block device type, that behaves differently from other
> types and is on a different major, you have to poke the distros into
> action or do it yourself.
>
> IOW, it was the high Just Works(tm) value of the SCSI layer when it came
> to ATA (not ATAPI) devices.
>
> For the future, ATA will eventually be more independent (though the SCSI
> simulator will be available as an option, for compat), but the value is
> big enough to put that task on the back-burner.
>
I remember being told that I didn't understand the problem when I
suggested using ide-scsi for everything and just hiding the transport. I
get great pleasure from having been (mostly) right on that one. I still
have old systems running ZIP drives as scsi...

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-10-18 13:07:45

by Rogier Wolff

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Tue, Oct 16, 2007 at 05:34:15PM +1000, Nick Piggin wrote:
> > It's a hard call. The I/O time for 1MB of contiguous disk data
> > is about the I/O time of 512 bytes of contiguous disk data.
>
> And if you're thrashing, then by definition you need to throw
> out 1MB of your working set in order to read it in.

Right. But you need a differential hit rate of only a few percent on
that 1020 extra kb of data you swapped in versus the 1Mb of data you
swapped out for this to be advantageous.

With "differential hit rate" I mean the chances of getting a hit on
the 1Mb of data just paged in, minus the chances of getting a hit on
the 1Mb of data just paged out.

With a little luck that 1Mb that is paged out didn't get used for
quite a while, while there is a hint that the 1Mb you're paging in
is active, as one of its sub-pages just got a hit.

So... IMHO, it would be useful to implement something that pages out
chunks of memory larger than a single hardware page. This would reduce
the size of the memory management tables (*), as well as improve disk
throughput if things DO come to paging....

This should of course be configurable. Some workloads are better off
with a virtual page size of 8k, some with 128k. some with 1M.

As far as I can see, the "page-cluster" parameter defines how many
pages at a time are selected for page-out at a time. This increases
the page-out efficiency. Improving the page-in efficiency is also
useful: It is the other half of hte equation.

Roger.


(*) If the kernel starts working with a 1Mb virtual page size, you
need a 256 times smaller mapping table between processes and memory or
swap. Of course, the hardware doesn't support this (actually, it does
for 1Mb virtual pages), so you'll have to create 256 page table
entries for the hardware instead of just one.



--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2007-10-19 06:49:45

by Rob Landley

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Thursday 18 October 2007 8:00:49 am Rogier Wolff wrote:
> So... IMHO, it would be useful to implement something that pages out
> chunks of memory larger than a single hardware page. This would reduce
> the size of the memory management tables (*), as well as improve disk
> throughput if things DO come to paging....

I believe that was more or less the topic of this paper:
http://kernel.org/doc/ols/2006/ols2006v2-pages-73-78.pdf

Although these seem sort of tangentially related:
http://kernel.org/doc/ols/2006/ols2006v1-pages-369-384.pdf
http://kernel.org/doc/ols/2006/ols2006v2-pages-125-130.pdf

Rob
--
"One of my most productive days was throwing away 1000 lines of code."
- Ken Thompson.

2007-10-19 07:22:01

by Rogier Wolff

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

On Fri, Oct 19, 2007 at 01:49:31AM -0500, Rob Landley wrote:
> On Thursday 18 October 2007 8:00:49 am Rogier Wolff wrote:
> > So... IMHO, it would be useful to implement something that pages out
> > chunks of memory larger than a single hardware page. This would reduce
> > the size of the memory management tables (*), as well as improve disk
> > throughput if things DO come to paging....
>
> I believe that was more or less the topic of this paper:
> http://kernel.org/doc/ols/2006/ols2006v2-pages-73-78.pdf

Not really. They are talking about doing this for the page
cache. That's where filesystem files are cached in memory. I'm talking
about the memory that programs use while they are running.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2007-10-20 09:49:20

by Pavel Machek

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

Hi!

> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>
> *shrug* It might. I was a letting it run hoping it would complete itself when

sysrq-f, IIRC.

> it locked solid. (The keyboard LEDs weren't flashing, so I don't _think_ it
> paniced. I was in X so I wouldn't have seen a message...)
>
> (To be honest, I can never remember how to trigger sysrq on a laptop keyboard.
> Presumably X won't intercept it the way it does alt-f1 and ctrl-alt-del...)

sysrq works even in X, and should be pressable on todays laptop
keyboards...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-10-20 09:50:41

by Pavel Machek

[permalink] [raw]
Subject: Re: OOM killer gripe (was Re: What still uses the block layer?)

Hi!

> I suppose I should just configure suspending to a file instead of a
> swap partition, but I've just historically trusted suspend/resume to a
> swap partition much more than to a file. Or maybe I should hack in a
> sysctl to prevent any swapping even though the swap partition is
> configured (so only suspend/resume will use it).

swapon -a; swsusp; swapoff -a?

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html