2002-09-01 23:39:22

by NeilBrown

[permalink] [raw]
Subject: PATCH - change to blkdev->queue calling triggers BUG in md.c


Changeset 1.573 (just prior to 2.5.33 release) changed the calling
sequence for blk_dev[major].queue so that it is now called before the
bd_op->open function is called.
This triggers a BUG in md.c which checked that the device was open
whenever ->queue was called. Patch below removes the BUG.

I'm actually a little disappointed by this change. I was hoping that
the ->queue might get changed to be passed a 'struct block_device *'
instead of a 'kdev_t' so that the device driver would only have to
interpret the device number in one place: the open. But now that
->queue is called before ->open, that wouldn't help.

I don't suppose it would make sense to do the default:
if (!bdev->bd_queue) {
struct blk_dev_struct *p = blk_dev + major(dev);
bdev->bd_queue = &p->request_queue;
}
bit where it is now, and leave the:
if (p->queue)
bdev->bd_queue = p->queue(dev);

bit until after the open? It would keep floppy happy, and make me
happy too, but I'm not sure that it is actually 'right'...

Anyway, here is the patch that stops md from BUGging out.

NeilBrown

### Comments for ChangeSet
Remove BUG in md.c that change in 2.5.33 triggers.

Since 2.5.33, the blk_dev[].queue is called without
the device open, so md_queue_proc can no-longer assume
that the device is open.


----------- Diffstat output ------------
./drivers/md/md.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

--- ./drivers/md/md.c 2002/09/01 23:27:10 1.1
+++ ./drivers/md/md.c 2002/09/01 23:28:27 1.2
@@ -3157,11 +3157,11 @@ request_queue_t * md_queue_proc(kdev_t d
{
mddev_t *mddev = mddev_find(minor(dev));
request_queue_t *q = BLK_DEFAULT_QUEUE(MAJOR_NR);
- if (!mddev || atomic_read(&mddev->active)<2)
- BUG();
- if (mddev->pers)
- q = &mddev->queue;
- mddev_put(mddev); /* the caller must hold a reference... */
+ if (mddev) {
+ if (mddev->pers)
+ q = &mddev->queue;
+ mddev_put(mddev);
+ }
return q;
}


2002-09-02 04:00:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002, Neil Brown wrote:
>
> I'm actually a little disappointed by this change. I was hoping that
> the ->queue might get changed to be passed a 'struct block_device *'
> instead of a 'kdev_t' so that the device driver would only have to
> interpret the device number in one place: the open. But now that
> ->queue is called before ->open, that wouldn't help.

We may still do this.

Right now the _only_ reason to call ->queue before open() is that open()
is also doing things like disk change checking, which reasonably needs the
queue because it can need to do IO in order to check the disk change
status. The floppy in fact did exactly this.

HOWEVER, that disk change checking really should be done by the generic
layers, and it should be done after the open() anyway (and not by the
open), and I think Al is actually working on this. That will allow us to
be a bit more flexible about the ordering.

Linus

2002-09-02 08:49:32

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

> HOWEVER, that disk change checking really should be done by
> the generic layers, and it should be done after the open() anyway
> (and not by the open)

Are you sure?
I am inclined to think that this would be an undesirable change of
open() semantics. Traditionally, and according to all standards,
open() will return ENXIO when the device does not exist.

Andries

2002-09-02 16:49:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002 [email protected] wrote:
> > HOWEVER, that disk change checking really should be done by
> > the generic layers, and it should be done after the open() anyway
> > (and not by the open)
>
> Are you sure?
> I am inclined to think that this would be an undesirable change of
> open() semantics. Traditionally, and according to all standards,
> open() will return ENXIO when the device does not exist.

Well, one reason I don't want the low-level drivers doing the media change
checking is that there's more to media change than just checking the
media.

For example, the higher levels want to do a partition table re-read if the
media really has changed. We do have this strange "bd_invalidated" thing
for passing that information back, and maybe that is acceptable. It's a
bit subtle, though.

Another reason why it would be good to factor out media change from open()
is that I can well imagine that somebody would want to do a "door open"
ioctl on a device without a media, and we actually do kind of have that
interface: opening with O_NDELAY historically means to not do the media
change checks.

And guess what? Because that test is done inside the low-level driver
right now, it means that these O_NDELAY semantics aren't actually known or
followed by most drivers, _and_ it means that the higher levels don't even
realize that sometimes the media check hasn't gotten done at all (ie
because the low-level "open()" is called only for the _first_ open, the
higher levels right now won't even call "open()" at _all_ later on and so
the media checks aren't done later when they should be).

However, your ENXIO point is a good one, and implies that we really should
have a more expressive "media_change()" function, so that if we'd factor
out open()/media_check(), then we'd still get the right ENXIO thing.

Linus

2002-09-02 20:30:57

by Andries Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

On Mon, Sep 02, 2002 at 10:01:46AM -0700, Linus Torvalds wrote:

> For example, the higher levels want to do a partition table re-read
> if the media really has changed.

My original setup made a kernel that does not know anything about
partition tables. User space would tell the kernel about partitions
on some block device.

Roughly speaking the impact is that there is a partx invocation
before a mount.

Now it seems Al is doing all the work, so I can just sit back and watch.
But I hope he makes precisely this: a kernel that does not do any
partition reading of its own.

Andries


[Yes, it is fundamentally wrong when the kernel starts guessing.
Guessing filesystem type is bad. Also guessing partition table type
is bad. Moreover, the kernel probing may lead to device problems
and even to kernel crashes, as I last observed two days ago.
Only the user knows what she wants to do with this disk. Format?
Remove OnTrack Disk Manager? There are all kinds of situations
where partition table re-read is directly harmful.]

2002-09-02 20:38:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002, Andries Brouwer wrote:
>
> Now it seems Al is doing all the work, so I can just sit back and watch.
> But I hope he makes precisely this: a kernel that does not do any
> partition reading of its own.

I disagree, if only because of backwards competibility issues.

On a conceptual level I think you're right. However, it will break too
many standard installations as is.

If/when we have a reasonable initrd setup that is usable, we could do some
automatic partitioning of devices that are available at bootup to minimize
the impact, but I don't think it is realistic otherwise.

Linus

2002-09-02 21:22:40

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

> But I hope he makes precisely this: a kernel that does not do any
> partition reading of its own.

I disagree, if only because of backwards compatibility issues.

On a conceptual level I think you're right. However, it will break too
many standard installations as is.

If/when we have a reasonable initrd setup that is usable, we could do some
automatic partitioning of devices that are available at bootup to minimize
the impact, but I don't think it is realistic otherwise.

Compare it with mounting.

It would be very bad if the kernel automatically mounted all
filesystems in sight. So, user space tells what to mount.
But at boot time there is a special situation.
In the end we want to have an initrd that mounts the rootfs,
but today we give kernel command line parameters with
rootfstype= and root=.

In a similar way it is bad that the kernel automatically tries
to interpret some data on a block device as a partition table.
The user can tell the kernel. (Yes, today.)
But at boot time there is a special situation.
In the end we want to have an initrd that does the partition reading,
but now we could give a kernel command line parameter with
rootpttype= and have the kernel only parse the partition table
of the root device.

Andries


[Yes, a shock, but very easy for people to add
blockdev --rereadpt /dev/foo
(or a partx call) in some bootscripts.]

[Don't think that I actually propose doing this today as the default,
but it would be a very small patch to add this as an optional
behaviour. But there is today, and there is the faraway goal.
The faraway goal is: no partition table reading in the kernel.
And that influences designing today what to do on media change.
Already today I would consider it entirely reasonable if there
was no automatic partition table reading after a media change.]

2002-09-02 21:26:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002 [email protected] wrote:
>
> Compare it with mounting.

NO.

The point about backwards compatibility is that things WORK.

There's no point in comparing things to how you _want_ them to work. The
only thing that matters for bckwards compatibility is how they work
_today_.

And your suggestion would break every single installation out there. Not
"maybe a few". Every single one.

(yeah, you could find some NFS-only setup that doesn't break. Big deal).

And backwards compatibility is extremely important.

Linus

2002-09-02 21:36:58

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

> The point about backwards compatibility is that things WORK.

Must I conclude that you did not read my entire letter?

Since we started this small detour talking about media change,
let me quote that fragment once more.

"[Don't think that I actually propose doing this today as the default,
but it would be a very small patch to add this as an optional
behaviour. But there is today, and there is the faraway goal.
The faraway goal is: no partition table reading in the kernel.
And that influences designing today what to do on media change.
Already today I would consider it entirely reasonable if there
was no automatic partition table reading after a media change.]"

No, my suggested changes would not break a single Linux installation
in the world.

Andries

2002-09-02 21:39:31

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Mon, 2 Sep 2002 [email protected] wrote:
> [Yes, a shock, but very easy for people to add
> blockdev --rereadpt /dev/foo
> (or a partx call) in some bootscripts.]

fdisk -r DEV -> read device's partition table

> The faraway goal is: no partition table reading in the kernel.

Why not the faraway goal: no partition tables any more? They're annoying.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-02 21:44:27

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Mon, 2 Sep 2002, Linus Torvalds wrote:
> The point about backwards compatibility is that things WORK.
>
> There's no point in comparing things to how you _want_ them to work. The
> only thing that matters for bckwards compatibility is how they work
> _today_.
>
> And your suggestion would break every single installation out there. Not
> "maybe a few". Every single one.
>
> (yeah, you could find some NFS-only setup that doesn't break. Big deal).
>
> And backwards compatibility is extremely important.

dep_bool ' New mountalike partitioning code' CONFIG_PARTMOUNTING CONFIG_EXPERIMENTAL CONFIG_WHATEVER

Or, since we're talking about the future:

<bool name="PARTMOUNTING">
<title>
New mount-alike partitioning code
</title>
<dep name="EXPERIMENTAL" sense="include" />
<dep name="WHATEVER" sense="exclude" />
</bool>

See? New Deal is for the ones that were annoyed by the old one.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-02 21:47:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002 [email protected] wrote:
>
> No, my suggested changes would not break a single Linux installation
> in the world.

.. by making your suggested behaviour not be used. Yes.

But if that is the case, then we _still_ need to fix the media change and
partition read issue. Right? Which brings back _all_ my points for why it
should be done at open time, and by the generic routine. Agreed?

Linus

2002-09-02 21:53:35

by Andries Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

On Mon, Sep 02, 2002 at 03:43:56PM -0600, Thunder from the hill wrote:

> > The faraway goal is: no partition table reading in the kernel.
>
> Why not the faraway goal: no partition tables any more? They're annoying.

As soon as the kernel stops reading partition tables, user space
is entirely free in what it does. One of the possibilities is
then of course: no partition tables.

2002-09-02 22:07:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002, Thunder from the hill wrote:
>
> Why not the faraway goal: no partition tables any more? They're annoying.

Yeah, users and real life is annoying.

Guys, Linux is not a research project. Never was, never will be. If you
want to have a research project that does things the way people think they
should be done (as opposed to real life and being practical), look at Hurd
and look at a lot of other projects. But don't look at Linux.

Partition tables are a fact of life. And they are a fundamental part to
being able to parse what the disk contains.

Sure, you can do it in user space too. And you can do TCP in user space.
But some things are just fairly fundamental to the working of the system.
The disk and filesystem layout is one such thing. It had better "just
work".

Linus

2002-09-02 22:27:30

by Hacksaw

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Speaking from the perspective of a long time computer user and sys-admin, I'm
trying to understand life without a partition table.

I operate under the following assumptions:

1. It's useful to have a physical disk divided into multiple logical disks.
2. It's therefore important that the bootloader know about them, assuming that
we want to be able to boot from any logical disk.
3. We can either have the bootloader spend time divining the structure of the
logical disks by scanning the physical disk or we can write it down in some
useful place.
4. That useful place is very near the front of the physical disk.

Of course, I'd be the first to admit that the current partition table is a
stupid design, but I can't see not having one at all.


--
We have three rights:
the right to work, the right to pay to work, and the right to suffer the
consequences of our work.
We have three obligations:
the obligation to work, the obligation to pay to work, and the obligation
to suffer the consequences of our work.
http://www.hacksaw.org -- http://www.privatecircus.com -- KB1FVD


2002-09-02 22:35:25

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Mon, 2 Sep 2002, Linus Torvalds wrote:
> On Mon, 2 Sep 2002, Thunder from the hill wrote:
> >
> > Why not the faraway goal: no partition tables any more? They're annoying.
>
> Guys, Linux is not a research project.
>
> Partition tables are a fact of life.

Linus, can you spell "faraway"? I wasn't talking about kicking
partitioning code from Linux 2.5, I was talking about inventing a better
way in 2010.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-02 22:53:44

by Oliver Neukum

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Am Dienstag, 3. September 2002 00:33 schrieb Hacksaw:
> Speaking from the perspective of a long time computer user and sys-admin,
> I'm trying to understand life without a partition table.
>
> I operate under the following assumptions:
>
> 1. It's useful to have a physical disk divided into multiple logical disks.

It's not only useful, without it there can be no cooperation among
operating systems. There are standards which have to be followed.

Partition detection has to work always and everywhere.
It has to work if you have booted into /bin/bash and half your
disk is gone and you're busily hacking away with a disk editor.

[..]
> Of course, I'd be the first to admit that the current partition table is a
> stupid design, but I can't see not having one at all.

It's not the only design.

Regards
Oliver

2002-09-02 23:03:57

by Andries Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

On Mon, Sep 02, 2002 at 03:00:27PM -0700, Linus Torvalds wrote:

> > No, my suggested changes would not break a single Linux installation
> > in the world.
>
> .. by making your suggested behaviour not be used. Yes.

Not so pessimistic. We go by small steps.

I think it important to get rid of partition table reading in the kernel.

It (pt reading) is wrong in principle, as we agree already.
But there are also all kinds of practical reasons.

One argument is that our traditional DOS-type partition table will soon
be at the end of its useful life. Yes, maybe it survives a few more years
but our own stability requires slow changes, so we must start thinking a
long time in advance.
Another argument is that it sometimes takes a *long* time, like several
minutes, especially when this reading triggers hardware bugs.
Another argument is that nobody knows whether there is a partition table.
In the case of ZIP drives there sometimes is a jumper or special SCSI command
to switch between the "large floppy" and "removable disk" statuses, and
the kernel doesnt know.
Another argument is that tricky things happen in the presence of disk managers.

So stage one is a kernel boot parameter "nopt" or so, that stops parsing
of partition tables other than the root partition. Some people need it
because of special problems, others just want to experiment. That is good,
and we'll get some feedback on partx and family.

Stage two happens a year later, when we have a working initrd. Seen from the
outside the new (kernel + initrd) plays the role of the old kernel.
Ha. That means that we can move the pt reading to initrd, and nobody notices.

Stage three happens when initrd and kernel no longer are so tightly coupled.
Initrd is just early userspace, tools exist to populate it, distributions make
their own. Now the kernel does not need any partition reading code and
nobody ever noticed. And the setup has become much more powerful.

-----

> But if that is the case, then we _still_ need to fix the media change and
> partition read issue. Right? Which brings back _all_ my points for why it
> should be done at open time, and by the generic routine. Agreed?

The above was mainly about the partition reading at boot time.
There are two other situations: partition reading at insmod time,
and partition reading at media change time.

But these are easier situations. There is a functioning userspace already.
As I said, in view of the desired direction, I would not mind at all if
a media change did not trigger partition reading today.
(In fact, for me, under 2.5.33, it doesn't. But blockdev --rereadpt helps.)


Andries

2002-09-02 23:15:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Tue, 3 Sep 2002, Andries Brouwer wrote:
>
> I think it important to get rid of partition table reading in the kernel.

Why?

> It (pt reading) is wrong in principle, as we agree already.

No, we don't agree.

I see that some people would like to remove it from the kernel, and I'm
not violently opposed to it if it can be done without breaking existing
behaviour.

But I do _not_ see any really fundamental reason why the kernel shouldn't
parse the partition tables. I see a lot of problems if the kernel were to
stop, and I don't see a lot of advantages to not doing so.

> But there are also all kinds of practical reasons.
>
> One argument is that our traditional DOS-type partition table will soon
> be at the end of its useful life. Yes, maybe it survives a few more years
> but our own stability requires slow changes, so we must start thinking a
> long time in advance.

That's a bad argument. It's not as if we want to have random formats for
this thing. Partitioning is damn important, and it has to be portable
across different machines and different operating systems. That all means
that there is absolutely _zero_ incentive to make up a partition format of
our own, since there are perfectly fine and existing formats.

> Another argument is that it sometimes takes a *long* time, like several
> minutes, especially when this reading triggers hardware bugs.

This is only an argument for doing it on demand, not for dropping it.

> Another argument is that nobody knows whether there is a partition table.
> In the case of ZIP drives there sometimes is a jumper or special SCSI command
> to switch between the "large floppy" and "removable disk" statuses, and
> the kernel doesnt know.
> Another argument is that tricky things happen in the presence of disk managers.

And none of these work any better in user space.

> > But if that is the case, then we _still_ need to fix the media change and
> > partition read issue. Right? Which brings back _all_ my points for why it
> > should be done at open time, and by the generic routine. Agreed?
>
> The above was mainly about the partition reading at boot time.
> There are two other situations: partition reading at insmod time,
> and partition reading at media change time.
>
> But these are easier situations. There is a functioning userspace already.

You seem to think that kernel space somehow cannot do something that user
space can. I just don't see the overriding problems you claim.

Linus

2002-09-02 23:21:40

by Hacksaw

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

>It's not only useful, without it there can be no cooperation among
>operating systems. There are standards which have to be followed.

Right. Of course, now we know that Thunder didn't mean to get rid of all
partition tables, but just the stupid one we are saddled with.

I wonder if there is a working group anywhere for this topic. It would effect
every OS and all the BIOS shops.
--
Listening changes what we are listening to.
http://www.hacksaw.org -- http://www.privatecircus.com -- KB1FVD


2002-09-03 00:48:52

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

> I think it important to get rid of partition table reading in the kernel.

Why?

Let me be more precise.
I think it important to get rid of automatic partition table reading
in the kernel.

Why?
Because in some cases it is undesirable.
Because in some cases it crashes the kernel.
Because it involves guessing and heuristics.
Because policy belongs in user space.


> One argument is that our traditional DOS-type partition table will
> soon be at the end of its useful life. Yes, maybe it survives
> a few more years but our own stability requires slow changes,
> so we must start thinking a long time in advance.

That's a bad argument. It's not as if we want to have random formats for
this thing. Partitioning is damn important, and it has to be portable
across different machines and different operating systems. That all means
that there is absolutely _zero_ incentive to make up a partition format of
our own, since there are perfectly fine and existing formats.

That is a separate discussion best left for some other time.
[But every OS has its own partition table type, and the types
are not compatible. We started using the DOS-type partition table.
But it is dying. Windows replaces it with their dynamic disks.
What do we do? Follow Microsoft? Pick the Plan9 format?]

> Another argument is that it sometimes takes a *long* time, like several
> minutes, especially when this reading triggers hardware bugs.

This is only an argument for doing it on demand, not for dropping it.

Yes - that is my main point: doing it on demand. On demand only.

> Another argument is that nobody knows whether there is
> a partition table. (ZIP: "large floppy" vs "removable disk")
> Another argument is that tricky things happen with disk managers.

And none of these work any better in user space.

Well, in fact they do.

The user knows whether she treats her ZIP like a removable disk
or like a big floppy, that is, whether she should ask or refrain
from asking to read the pt.

And yes, if the partitions on the disk are to be shifted by 63 sectors
then partx can notice that and tell the kernel. But if the kernel does
these things automatically it can be difficult to remove Disk Manager.


You seem to think that kernel space somehow cannot do something that
user space can. I just don't see the overriding problems you claim.

It is the user who knows and wants to decide.

If my disk has media errors and I want to rescue what still can be read,
then I am very unhappy that the kernel starts reading the first sector
and the last sector and various sectors in the middle.
I want to have very precise control over what I/O happens.

If I insert a SmartMedia card then I know very precisely that it has
a FAT filesystem, a special one. Some cameras will refuse to read
such cards formatted by DOS. If the kernel starts probing, as it does
today, then it will read the first sector and the last sector, etc.
But my reader has a firmware bug, an off-by-one mistake in the reported
capacity, and the kernel tries to read a sector past the end of the card,
gets an error and the SCSI code starts retrying, resetting the device,
the host, the bus, finally takes the device offline. In the meantime
the USB code is entirely confused by aborts and crashes the kernel.
Of course both SCSI and USB code have to be improved, but it would
certainly be nice if I could tell from userspace: probe only for FAT.
No need at all to read this last sector.

I have seen partition tables with a loop. They would poison Linux
so that it was impossible to boot Linux on a system with such a disk.

I have seen disks with random test data causing Linux to go out and
read nonexistent sectors. There is the real possibility that no
partition table is present, and trying to find one may be a bad idea.

I have seen disks that form part of a multi-disk array.
Often the partition tables are meaningless.


Not doing things automatically gives power to the user.
In some situations this power is needed.


And once this partition reading is done on demand only, it does not
matter very much who does the reading. It may be the kernel.
It may be a user space program.

Andries

2002-09-03 03:53:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Mon, 2 Sep 2002, Linus Torvalds wrote:
>
> However, that has nothing to do with whether it is in user space or kernel
> space. In many ways it is _easier_ to do on demand in kernel space: when
> somebody opens /dev/sda1 and it isn't partitioned yet, you know it needs
> to be.

Note that this actually allows you to do your own user-space partitioning
if you want to - simply by making sure that you do your partitioning
_before_ somebody tries to open a partition on the device.

And if you look at how fs/block_dev.c looks right now, you'll notice that
we already handle the "main device" vs "sub-partition" cases differently,
so it should be fairly straightforward to eventually do the partitioning
on demand.

We're not there yet, no. But doing it in the open() path of
fs/block_dev.c sure looks like it's the easiest way to maintain sanity wrt
partitioning, _and_ maintain 100% backwards compatibility.

[ Well, the "100% backwards compatibility" is not strictly true. Doing
partition handling on demand will mean that things like /proc/partitions
will obviously also end up being populated on demand, which may break
various sysadmin tools. But at least then it's fairly well localized,
and it's reasonably easy to grep for /proc/partitions in tools to see if
they may care ]

Linus

2002-09-03 03:43:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On Tue, 3 Sep 2002 [email protected] wrote:
>
> Why?
> Because in some cases it is undesirable.

Again, Why?

You can always use the flat device as-is.

> Because in some cases it crashes the kernel.

But moving it to user space would cause the kernel to crash anyway. Bugs
are bugs.

> Because it involves guessing and heuristics.

The same guesses and heuristics would have to be in user space.

> Because policy belongs in user space.

It's not policy. It's a fact of life that disks need to be split up into
parts, and the partitioning schemes are well-defined and shared across
multiple operating systems.

> Yes - that is my main point: doing it on demand. On demand only.

But I actually _agree_ with this.

However, that has nothing to do with whether it is in user space or kernel
space. In many ways it is _easier_ to do on demand in kernel space: when
somebody opens /dev/sda1 and it isn't partitioned yet, you know it needs
to be.

The fact that partitioning right now is to some degree handled by device
drivers is a problem, but that's not a user space vs kernel space issue.
It's slowly getting moved to higher levels.

Linus

2002-09-03 15:17:37

by Andries Brouwer

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

On Mon, Sep 02, 2002 at 08:55:47PM -0700, Linus Torvalds wrote:

Discussion so far:

(1) It is wrong when the kernel guesses, because it may guess wrong.
Userspace must tell the kernel what to do.

[The mount call is not "mount dev dir" but "mount -t type dev dir".
The kernel could guess, and often guess right, but some types are
close, like ext2 and ext3, or various ufs types, and some types may be
indistinguishable from the disk image, like msdos and vfat, where the
right type may depend on the intentions of the user.]

[In a similar way it is bad when the kernel unprovoked starts trying
to interpret the first few and last few sectors of the disk as an
Acorn, Amiga, Atari, BSD, DOS, EFI, IBM, Mac, Minix, LDM, OSF, SGI,
Sun, Ultrix partition table. Maybe there was no table. Maybe there
is a table of a kind the kernel did not know about, e.g. an AIX or
Plan 9 table, or a newer version of *BSD or Minix while the kernel
only knows about older versions, or ...]


(2) In all kinds of special situations attempts to read a partition
table lead to errors, even to kernel crashes. The kernel should not
unprovoked start doing I/O, guessing where the partition table might be,
and what type it might have.

> > Yes - that is my main point: doing it on demand. On demand only.
>
> But I actually _agree_ with this.
>
> However, that has nothing to do with whether it is in user space or kernel
> space. In many ways it is _easier_ to do on demand in kernel space: when
> somebody opens /dev/sda1 and it isn't partitioned yet, you know it needs
> to be.

At first I misread this sentence ("partitioning" for me is something
done with fdisk) but now I take it to mean: If we have /dev/sda
but have not read its partition table, and somebody opens /dev/sda1,
then we decide that we must read a partition table.

If that is what you mean, I disagree.
(Compare: we have /mnt/cdrom and someone opens /mnt/cdrom/foo, should we
decide to automatically mount /dev/cdrom? An automounter in user space
may do such things. The kernel may not.)

> Note that this actually allows you to do your own user-space partitioning
> if you want to - simply by making sure that you do your partitioning
> _before_ somebody tries to open a partition on the device.

You are inventing a can of worms. Suppose user space already told the
kernel where the partitions are, and the kernel knows about sda1, sda2, sda3.
Now somebody refers to sda4. Does the kernel start reading the device,
possibly changing the meaning of sda1 etc?
What if this disk is part of a RAID?

No, we must slowly migrate to the state where the kernel never takes the
initiative to search for a partition table. That initiative belongs to
user space.

Andries

2002-09-03 17:18:24

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Mon, 2 Sep 2002, Hacksaw wrote:
> 1. It's useful to have a physical disk divided into multiple logical disks.
> 2. It's therefore important that the bootloader know about them, assuming that
> we want to be able to boot from any logical disk.
> 3. We can either have the bootloader spend time divining the structure of the
> logical disks by scanning the physical disk or we can write it down in some
> useful place.
> 4. That useful place is very near the front of the physical disk.

My "visions" go elsewhere:

The users who still need partition tables shall get theirs in a sane way
-- maybe as David Miller proposed, or just simple Sun-styled partition
tables (even though I don't think they're _much_ saner than PC partition
tables).

And the second thing is about your point one. We have two big raid arrays
divided into three racks. Here we have one logical disk divided into many
physical disks. If you want to know the constraints, call the controller
and ask for it. The rest is bogus -- why should I have a partition table?
Maybe divide the raid into smaller disks?!

That's it.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-03 17:58:19

by Hacksaw

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

>The users who still need partition tables
My main gripe was my impression that you wanted to do away entirely with
partition tables, which I am now taking as a misread.

I can certainly imagine a few different ways to have partition tables that
make more sense than the typical Wintel version.

>Maybe divide the raid into smaller disks?!

Absolutely, if that is your requirement. I have done this. It gives you the
usefulness of smaller disks with the speed and reliability of the RAID.

More importantly, The hardware should be considered largely immutable. For
reliability reason, I want the hardware to have its settings in the safest
manner possible, which means not taxing flash ram with too many rewrites. The
place for the logical layout of the disks is in the partition table on the
disk. One reason for this: what if the controller dies? In fact, I'd like the
controller to store its RAID setup on the disk as well. Maybe even on all of
them. Of course, if the partition equals the entire disk, great. The table
will be really short.

In fact, I want a number of backup partition tables (a la backup superblocks).
If you've ever had 70 people waiting to be able to do any work while you try
and restore a disk that had the partition table scribbled on, you appreciate
what I am saying.
--
The highest quality of attention we may give is love.
http://www.hacksaw.org -- http://www.privatecircus.com -- KB1FVD


2002-09-03 19:34:10

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Tue, 3 Sep 2002, Hacksaw wrote:
> >The users who still need partition tables
>
> My main gripe was my impression that you wanted to do away entirely with
> partition tables, which I am now taking as a misread.
>
> I can certainly imagine a few different ways to have partition tables that
> make more sense than the typical Wintel version.

Certainly. But there are some uses where you absolutely don't want them.
Also are there lots of uses where you can't do away with the tight thing
DOS gives you for i386. I'm happily using Sunish slices there.

> >Maybe divide the raid into smaller disks?!
>
> Absolutely, if that is your requirement. I have done this. It gives you
> the usefulness of smaller disks with the speed and reliability of the
> RAID.

Possibly, but you can even put together smaller arrays then (at least on a
sane controller). That's what the disk size setup is for. And that's,
after all, nothing the OS needs to take care of. It just gets to see say
16 disks of 1 TiB size each. Nothing mad.

But why have a partition table on each of these 16 pseudo disks?

> More importantly, The hardware should be considered largely immutable.

Why? There's a reason why raid is mutable.

> One reason for this: what if the controller dies?

I plug another in here and restore.

> In fact, I'd like the controller to store its RAID setup on the disk as
> well. Maybe even on all of them.

I don't see how partitioning could help you out here.

> Of course, if the partition equals the entire disk, great. The table
> will be really short.

...and pointless.

> If you've ever had 70 people waiting to be able to do any work while you
> try and restore a disk that had the partition table scribbled on, you
> appreciate what I am saying.

Well, there's absolutely no need to do that. There are tools which can
scribble together the old table, and these could even kick automagically.
(Who else thinks the BIOS shall always be replaced with something sane?)
Once your table is restored, good luck.

On the i386 boxes that we run (recently reduced to 20) I don't change too
much about the tables. I can use a perl script from a boot cd to scribble
them together again. And here I agree with partition tables, because I
mostly haven't got more than two disks -- and I need to have space for
swap, housings, workspace and the entire root.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-03 20:13:36

by Hacksaw

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


> Possibly, but you can even put together smaller arrays then (at least on a
> sane controller). That's what the disk size setup is for. And that's,
> after all, nothing the OS needs to take care of. It just gets to see say
> 16 disks of 1 TiB size each. Nothing mad.
>
> But why have a partition table on each of these 16 pseudo disks?

If not a partition table, then some positive indication that there is nothing
more.

> > More importantly, The hardware should be considered largely immutable.
>
> Why? There's a reason why raid is mutable.

Because changing hardware is often quite annoying, and proprietary, and hard
or impossible to script. I had one RAID that could only be set up by changing
things using four buttons on the front panel.

But more importantly, I want controllers that survive total power down. No
battery backed RAM, not for the setup. The cache can and should be, but the
set up needs to survive user stupidity as well as hardware defects. And flash
RAM still has a limited number of cycles, and can sometimes have *very*
limited cycles, despite the best efforts of the manufacturer.

> > One reason for this: what if the controller dies?
>
> I plug another in here and restore.

Restoring a 4TiB RAID is not what I want to be doing with a department of idle
developers, especially if the problem is strictly with the controller. I want
the controller to be able to get it's information from the disks. That way I
plug in the new controller, and it says "Ready to go."

> > In fact, I'd like the controller to store its RAID setup on the disk as
> > well. Maybe even on all of them.
>
> I don't see how partitioning could help you out here.

It was an aside.

Another factor is usage. I've had users who were assigned a slice of a RAID
who asked me to divide the slice up, so that one experiment couldn't hose
another by filling the whole logical disk.

> > Of course, if the partition equals the entire disk, great. The table
> > will be really short.
>
> ...and pointless.

Admittedly. But it's important to indicate the lack of a partition table, so
the software can get on with life, rather than trying to determine if the
partition table got scribbled on.

>
> > If you've ever had 70 people waiting to be able to do any work while you
> > try and restore a disk that had the partition table scribbled on, you
> > appreciate what I am saying.
>
> Well, there's absolutely no need to do that. There are tools which can
> scribble together the old table, and these could even kick automagically.
> (Who else thinks the BIOS shall always be replaced with something sane?)
> Once your table is restored, good luck.

Again, searching a huge RAID to determine the old partition is not something I
want to spend an afternoon doing, not when I can spend a few blocks out of
millions and have backups.

--
We perceive our perceptions.
http://www.hacksaw.org -- http://www.privatecircus.com -- KB1FVD


2002-09-03 20:29:36

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Tue, 3 Sep 2002, Hacksaw wrote:
> > But why have a partition table on each of these 16 pseudo disks?
>
> If not a partition table, then some positive indication that there is
> nothing more.

And waste a block? With correct raid organization the controller takes up
the job of your partition table. (Yes, regardless of where it' storing
data.)

> Because changing hardware is often quite annoying, and proprietary, and
> hard or impossible to script.

Anyway, that's news to me. Where?

> I had one RAID that could only be set up by changing things using four
> buttons on the front panel.

Well, I'm talking about a vision of replacing partition tables with
sensible and gentle disk handling, where possible. Old technics definitely
need some other kind of hammer, or to get replaced.

> But more importantly, I want controllers that survive total power down.

You can't get that with partition tables either. And by the way, we
succeeded doing that at Magdeburg. Pull out the power supply, batteries,
etc., then run away.

Yes, we need better materia. Mind the biotechnical approach. Still doesn't
qualify as a statement for partition tables.

> Restoring a 4TiB RAID is not what I want to be doing with a department of idle
> developers

That's why sensible raid systems do that on their own. In the meanwhile
you can get around with the data. Mostly it's nothing more than a "find
disk order", which is not too bad.

> Another factor is usage. I've had users who were assigned a slice of a
> RAID who asked me to divide the slice up, so that one experiment
> couldn't hose another by filling the whole logical disk.

Then give them two logical disks. Just a matter of management.

> Admittedly. But it's important to indicate the lack of a partition
> table, so the software can get on with life, rather than trying to
> determine if the partition table got scribbled on.

Yes, that's cool in case we'd possibly need one. But in the raid cases we
should get to a position where partition tables are not just theoretically
meaningless.

> Again, searching a huge RAID to determine the old partition is not
> something I want to spend an afternoon doing, not when I can spend a few
> blocks out of millions and have backups.

I've still not said you'd have to do that. You can have a perl script do
your job scribbling the table together.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-03 20:38:59

by Alan

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

On Tue, 2002-09-03 at 21:34, Thunder from the hill wrote:
> Well, I'm talking about a vision of replacing partition tables with
> sensible and gentle disk handling, where possible. Old technics definitely
> need some other kind of hammer, or to get replaced.

And what about all the firmware that needs PC partition tables ? They
won't be going away in a hurry even as/if EFI replaces the DOS partition
format. More likely we'll get superextendedwhizzopartition types and a
hack on a hack of the origina DOS ones.

> > But more importantly, I want controllers that survive total power down.
>
> You can't get that with partition tables either. And by the way, we
> succeeded doing that at Magdeburg. Pull out the power supply, batteries,
> etc., then run away.

Why not - you can journal partition updates too. There are systems out
there that do it, even ones that do cluster safe partition management on
the fly.

If you want to do partitions in user space and play with the idea the
LVM2 code is very clean, very nice and already provides you with
everything needed to do it nicely.

2002-09-03 20:42:56

by Hacksaw

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

> > But more importantly, I want controllers that survive total power down.
>
> You can't get that with partition tables either. And by the way, we

WHAT? Partition tables written onto the disk certainly do survive total power
down.

>
> Then give them two logical disks. Just a matter of management.

Again, with an annoying controller, and having the user change their
requirements every so often (like once a day), I do not want to change the
RAID setup lots. The last RAID I was working with took up to an hour to commit
geometry changes to the disk.

> Yes, that's cool in case we'd possibly need one. But in the raid cases we
> should get to a position where partition tables are not just theoretically
> meaningless.

Again, I wouldn't want to depend on that, for the reasons above.

> I've still not said you'd have to do that. You can have a perl script do
> your job scribbling the table together.

Please describe this algorithm? Would this potentially mean looking at every
block on the disk, including the giant logical disk that a RAID might present?
Even if you only have to look at the first few bytes of each block, this is a
lot of seeking.
--
A decision changes the world.
http://www.hacksaw.org -- http://www.privatecircus.com -- KB1FVD


2002-09-03 20:51:39

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On 3 Sep 2002, Alan Cox wrote:
> And what about all the firmware that needs PC partition tables ?

As mentioned, they should really consider going away. It's currently a bit
crappy all over, but we might learn from that.

> They won't be going away in a hurry

Not too quickly, but they will. Maybe even before the apocalypse.

> > You can't get that with partition tables either. And by the way, we
> > succeeded doing that at Magdeburg. Pull out the power supply, batteries,
> > etc., then run away.
>
> Why not - you can journal partition updates too. There are systems out
> there that do it, even ones that do cluster safe partition management on
> the fly.

That didn't help when they told us the water would come. We've put the
disks into bags and unplugged the supply.

And if you meant why not use journaled partition updates on raid -- I
still don't see how this could be any good without complicating things.
Maybe you can enlighten me?

> If you want to do partitions in user space and play with the idea the
> LVM2 code is very clean, very nice and already provides you with
> everything needed to do it nicely.

LVM2 is not the kind of thing I'd want to use on my big bad mainframe. It
may be feasible, but it doesn't have that smell. And where to plug all
those disks?

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-03 21:01:17

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Tue, 3 Sep 2002, Hacksaw wrote:
> > > But more importantly, I want controllers that survive total power down.
> >
> > You can't get that with partition tables either.
>
> WHAT? Partition tables written onto the disk certainly do survive total power
> down.

Disk-backed raid config storage can do that just as well. I don't really
think a partition table is the thing you'll want in order to store your
raid config.

> Again, with an annoying controller, and having the user change their
> requirements every so often (like once a day), I do not want to change
> the RAID setup lots. The last RAID I was working with took up to an hour
> to commit geometry changes to the disk.

As mentioned -- there may still be bad hardware out there. And why can't
the people just live with it? I mean you can't resize disks either. You
could even assign a number of disks of different size to the users, and if
one of them needs to get to another level of disk space, shall he. A pool
of disks can save you resizes.

> Please describe this algorithm? Would this potentially mean looking at
> every block on the disk, including the giant logical disk that a RAID
> might present? Even if you only have to look at the first few bytes of
> each block, this is a lot of seeking.

(I still wonder how often you resize your partitions, per second.)

I was talking about saving your time. And I've presented the theory of no
partition tables on raid, which reduces the whole thing to backup work on
PC, or few seeking on small disks, up to 200 GiB. Yes, currently.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-03 21:25:12

by Hacksaw

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

> Disk-backed raid config storage can do that just as well. I don't really
> think a partition table is the thing you'll want in order to store your
> raid config.

Don't mix the two. I want the RAID to store it's configuration on disk.
Additionally, I want the OS to be able to do what it wants with the disk,
which may well mean logically dividing the disks presented by the RAID into
logical partitions.

In fact, the two systems (RAID controller and OS) should not have anything to
do with each other. The OS should just see disks, and the controller should
just see raw data for it to put where ever the OS says to put it.

> As mentioned -- there may still be bad hardware out there. And why can't
> the people just live with it? I mean you can't resize disks either.

Business requirement say you can't live with it. And yes you can resize disks,
especially if you don't mind hosing the data thereon.

You
> could even assign a number of disks of different size to the users, and if
> one of them needs to get to another level of disk space, shall he. A pool
> of disks can save you resizes.

We bought disk space like it was going out of style, but don't believe for an
instant we had *any* left over. It was often assigned before we ever even got
the disks in house. And when one guys wants his 20G as one big block, and
another guy wants 5 4G slices, I have to accomodate him. And when he later
wants his 5 slices made into two bigs ones, I have to be able to do that.

> (I still wonder how often you resize your partitions, per second.)

Not often. I probably changed disk setup once a week. Even still, I want it to
take me almost zero time, and the RAID as a whole must remain up at all times,
24/7.

> I was talking about saving your time. And I've presented the theory of no
> partition tables on raid, which reduces the whole thing to backup work on
> PC, or few seeking on small disks, up to 200 GiB. Yes, currently.

The theory doesn't match my experience. Maybe ten years from now it will. But
I doubt it seriously. In ten years I think we'll have PC's shipping with 4TiB
disks that are still damned slow.

And my time could include every developer. At $250 an hour, 70 idle developers
= $17,500 for an hour of down time.
--
We recognise in others what we know most deeply in ourselves.
http://www.hacksaw.org -- http://www.privatecircus.com -- KB1FVD


2002-09-03 21:47:20

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Tue, 3 Sep 2002, Hacksaw wrote:
> > Disk-backed raid config storage can do that just as well. I don't
> > really think a partition table is the thing you'll want in order to
> > store your raid config.
>
> Additionally, I want the OS to be able to do what it wants with the
> disk, which may well mean logically dividing the disks presented by the
> RAID into logical partitions.

We're going cyclic. All over. I think I've made my point, and you've made
yours. Yet I could live with my raid(s). IMO you can handle it all at the
level of clueful hardware presenting virtual hardware to your OS.

> In fact, the two systems (RAID controller and OS) should not have
> anything to do with each other. The OS should just see disks, and the
> controller should just see raw data for it to put where ever the OS says
> to put it.

I agree for more than a hundred percent. And I think the OS shouldn't have
to bother with disk slicing where it's not necessary.

> And yes you can resize disks, especially if you don't mind hosing the
> data thereon.

I've meant to say "physical disks". I'm happily aware of raid virtual disk
resizing.

> > (I still wonder how often you resize your partitions, per second.)
>
> Not often. I probably changed disk setup once a week. Even still, I want it to
> take me almost zero time, and the RAID as a whole must remain up at all times,
> 24/7.

Enough time to back up the stuff.

> The theory doesn't match my experience. Maybe ten years from now it
> will. But I doubt it seriously.

I'm not talking about all the stuff that's on the market right now. I'm
talking about the good stuff which we should evolve within the next few
years in order to get proper support for our requirements.

> In ten years I think we'll have PC's shipping with 4TiB disks that are
> still damned slow.

I'm still talking about raid here.

> And my time could include every developer. At $250 an hour, 70 idle
> developers = $17,500 for an hour of down time.

My response is: don't watch the script work.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-03 21:53:48

by Alan

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

On Tue, 2002-09-03 at 21:55, Thunder from the hill wrote:
> And if you meant why not use journaled partition updates on raid -- I
> still don't see how this could be any good without complicating things.
> Maybe you can enlighten me?

If you have a good raid card then you can do online resizing, volume
allocation, volume format changing, volume migration etc. For those
cases you have to get the journalling right in order to be able to do
that kind of thing properly

> LVM2 is not the kind of thing I'd want to use on my big bad mainframe. It
> may be feasible, but it doesn't have that smell. And where to plug all
> those disks?

Standard PC with 80Gb disks benefits from dynamic partitioning. But if
you are pushed then you can shove 3ware 8500 PCI cards into your slots
and get 12 SATA hotplug IDE channels per PCI slot.

Thats 12 * 200Gb hotswap per pci slot. Which given 4 slots of it would
come out at a nice 9600Gb of disk. Maybe you can archive usenet on one
PC after all 8)

Alan

2002-09-03 21:58:36

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On 3 Sep 2002, Alan Cox wrote:
> If you have a good raid card then you can do online resizing, volume
> allocation, volume format changing, volume migration etc. For those
> cases you have to get the journalling right in order to be able to do
> that kind of thing properly

That's true, if you use partitions. I don't see the problem.

> Standard PC with 80Gb disks benefits from dynamic partitioning. But if
> you are pushed then you can shove 3ware 8500 PCI cards into your slots
> and get 12 SATA hotplug IDE channels per PCI slot.

Oh, well. IDE.

> Thats 12 * 200Gb hotswap per pci slot. Which given 4 slots of it would
> come out at a nice 9600Gb of disk. Maybe you can archive usenet on one
> PC after all 8)

Yes, but lately that's rather a funny than an enterprise solution.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-04 00:05:24

by Oliver Neukum

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Am Mittwoch, 4. September 2002 00:03 schrieb Thunder from the hill:
> Hi,
>
> On 3 Sep 2002, Alan Cox wrote:
> > If you have a good raid card then you can do online resizing, volume
> > allocation, volume format changing, volume migration etc. For those
> > cases you have to get the journalling right in order to be able to do
> > that kind of thing properly
>
> That's true, if you use partitions. I don't see the problem.

No, it's always a problem. You need to record somewhere, what you
use which disk for. If these recordings need to be changeable
on a live system, you need to make sure that they are always in a
consistent state.

Regards
Oliver

2002-09-04 05:45:52

by Thunder from the hill

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c

Hi,

On Wed, 4 Sep 2002, Oliver Neukum wrote:
> No, it's always a problem. You need to record somewhere, what you use
> which disk for. If these recordings need to be changeable on a live
> system, you need to make sure that they are always in a consistent
> state.

Yes. And for a workstation system it makes sense to store that on top of
the disk. But not exactly on raid.

Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-

2002-09-04 07:57:33

by Giuliano Pochini

[permalink] [raw]
Subject: Re: PATCH - change to blkdev->queue calling triggers BUG in md.c


On 03-Sep-2002 Hacksaw wrote:
> I can certainly imagine a few different ways to have partition tables that
> make more sense than the typical Wintel version.

Yes, I would like to name partitions. I'd like to tell the kernel
to boot from root partition "John" and to
mount -t ext2 --part Lisa /mnt/bkup without the very annoying thing
of boot and mounts failures when I add/remove phisical devices. To
do this we need partition table parsing inside the kernel IMHO.


Bye.