Hi,
I was looking into an issue that someone was having with raid5.
They made an md/raid5 out of 5 whole devices and by luck the data
that was written to the first block of the 5th device looked
slightly like a partition table. fdisk output below for the curious.
However some partitions were beyond the end of the device.
This resulted in annoying error messages popping up when various
programs such as mount (mounting by label) looked at various
partitions.
The messages were like:
Buffer I/O error on device sde3, logical block 1794
This logical block is way beyond the end of sde.
What is happening is that the partition is found to be beyond the end
of the device, a message is printed about this
sde: p3 exceeds device capacity
but nothing is done about it.
So IO requests in the partition can go beyond the end of the device -
and the code that is supposed to protect devices from this doesn't.
In ll_rw_blk.c, in generic_make_request, the max bi_sector is checked
against maxsector and an error is reported if bi_sector is too large.
However this only happens before blk_partition_remap is called to map
the partition-sector to the device-sector. The result of
blk_partition_remap is clearly expected to still be in range
*
* NOTE: we don't repeat the blk_size check for each new device.
* Stacking drivers are expected to know what they are doing.
Partitioning isn't exactly a 'stacking driver', but it doesn't seem
to know what it is doing :-)
So: Is there any good reason to not clip the partitions to fit
within the device - and discard those that are completely beyond
the end of the device??
The patch at the end of the mail does that. Is it OK to submit this
to mainline?
NeilBrown
# fdisk /dev/sde
Command (m for help): p
Disk /dev/sde: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sde1 1 1 995+ c7 Syrinx
Partition 1 does not end on cylinder boundary.
/dev/sde2 1 1 0 0 Empty
Partition 2 does not end on cylinder boundary.
/dev/sde3 133675 133675 995+ c7 Syrinx
Partition 3 does not end on cylinder boundary.
/dev/sde4 1 1 0 0 Empty
Partition 4 does not end on cylinder boundary.
Signed-off-by: Neil Brown <[email protected]>
### Diffstat output
./block/ioctl.c | 6 ++++++
./fs/partitions/check.c | 12 ++++++++++--
2 files changed, 16 insertions(+), 2 deletions(-)
diff .prev/block/ioctl.c ./block/ioctl.c
--- .prev/block/ioctl.c 2006-10-09 14:25:11.000000000 +1000
+++ ./block/ioctl.c 2006-10-13 09:45:52.000000000 +1000
@@ -42,6 +42,12 @@ static int blkpg_ioctl(struct block_devi
|| pstart < 0 || plength < 0)
return -EINVAL;
}
+ /* Does partition fit within device */
+ if (start + length > get_capacity(disk)) {
+ if (start >= get_capacity(disk))
+ return -EINVAL;
+ length = get_capacity(disk) - start;
+ }
/* partition number in use? */
mutex_lock(&bdev->bd_mutex);
if (disk->part[part - 1]) {
diff .prev/fs/partitions/check.c ./fs/partitions/check.c
--- .prev/fs/partitions/check.c 2006-10-13 09:28:21.000000000 +1000
+++ ./fs/partitions/check.c 2006-10-13 09:46:12.000000000 +1000
@@ -466,8 +466,16 @@ int rescan_partitions(struct gendisk *di
if (!size)
continue;
if (from + size > get_capacity(disk)) {
- printk(" %s: p%d exceeds device capacity\n",
- disk->disk_name, p);
+ if (from >= get_capacity(disk)) {
+ printk(KERN_INFO
+ " %s: p%d starts beyond end of device\n",
+ disk->disk_name, p);
+ continue;
+ }
+ printk(KERN_INFO
+ " %s: p%d exceeds device capacity - clipping\n",
+ disk->disk_name, p);
+ size = get_capacity(disk) - from;
}
add_partition(disk, p, from, size);
#ifdef CONFIG_BLK_DEV_MD
On Fri, Oct 13, 2006 at 09:50:49AM +1000, Neil Brown wrote:
> So: Is there any good reason to not clip the partitions to fit
> within the device - and discard those that are completely beyond
> the end of the device??
Almost precisely this issue came up recently.
If I recall correctly at that time the idea was to discard
partitions that do not fit on the known disk. A bad idea.
Your idea is better, namely to clip partitions.
Still, there are a few reasons why one should be careful.
One is the existence of clipped disks. There are various ways of
making a disk appear smaller than it really is - there may be
a HPA or DCO or so, or just a capacity-limiting jumper.
This may mean that the kernel does not really know the size
of the disk. The jumper may cause IDENTIFY to return a small size
while actual I/O succeeds beyond that. Or, a SETMAX command is
needed to make all of a partition available. Etc.
One is the numbering of partitions. People are very unhappy
when something causes their partitions to be renumbered.
That is an argument against the discarding.
In the forensics situation you want to take a copy of a disk.
But often that is impractical - copying this 500GB disk takes too long,
or the scratch disk is not large enough, and the copy only holds the
initial part of a disk.
You do not want to discard such partial partitions - maybe clipping is OK,
although I would prefer to see precisely the same data on the copy as on
the original, except of course that actually accessing nonexistent data
returns an I/O error, but discarding would again cause renumbering. Bad.
[As an aside: for the past twelve years or so I have muttered once a year
that it is bad that Linux does automatic probing for partitions.
It will be mistaken every now and then.
With some partition types there is a fairly large probability
that random data is seen as a partition table.
A correct system does not guess (unless asked to guess by the user).
A correct system is set up in such a way that the boot parameters tell it
1. the root disk, 2. the partition type of the root disk,
3. the root partition, 4. the filesystem type of the root filesystem.
Now the root disk can contain configuration data that causes the system
to look at specified disks in specified ways, or to do default things.
With a system that was set up correctly, your nonsense partitions
would never have been found by the kernel, and I suppose mount by label
would not have encountered any problems.]
Andries
On Friday October 13, [email protected] wrote:
>
> One is the existence of clipped disks. There are various ways of
> making a disk appear smaller than it really is - there may be
> a HPA or DCO or so, or just a capacity-limiting jumper.
> This may mean that the kernel does not really know the size
> of the disk. The jumper may cause IDENTIFY to return a small size
> while actual I/O succeeds beyond that. Or, a SETMAX command is
> needed to make all of a partition available. Etc.
Yes..... but access to the whole device is clipped based on the
kernel's understanding of the size of the device, and allowing code
too by-pass that clipping by using a partition seems just *wrong*.
Presumably a very-privileged BLKSETSIZE64 would be the appropriate
thing for that sort of situation.
If a device claims to be smaller will the low-level driver actually
allow a request to a larger address at all???
>
> One is the numbering of partitions. People are very unhappy
> when something causes their partitions to be renumbered.
> That is an argument against the discarding.
Valid comment. My patch doesn't affect numbering.
Partitions that are beyond the end of the device are not enabled at
all, but that doesn't affect the numbering of later partitions.
>
> In the forensics situation you want to take a copy of a disk.
> But often that is impractical - copying this 500GB disk takes too long,
> or the scratch disk is not large enough, and the copy only holds the
> initial part of a disk.
> You do not want to discard such partial partitions - maybe clipping is OK,
> although I would prefer to see precisely the same data on the copy as on
> the original, except of course that actually accessing nonexistent data
> returns an I/O error, but discarding would again cause renumbering. Bad.
So your main points seem to be:
- removing access that people currently have is bad
- causing partitions to be renumbered is bad.
I certainly agree. Unless people actually do use partitions to bypass
the device-size check in generic_make_request, this patch doesn't do
either these bad things.
The only visible effect should be:
- partitions that have no accessible blocks will no longer be able to
be opened.
- partitions that start in-range and end beyond range will appear
smaller than they would have done.
- reading the last block of a partition will no longer cause error
messages that look like there are device errors.
All of these seem to be acceptable effects. Do you agree?
>
> [As an aside: for the past twelve years or so I have muttered once a year
> that it is bad that Linux does automatic probing for partitions.
> It will be mistaken every now and then.
> With some partition types there is a fairly large probability
> that random data is seen as a partition table.
> A correct system does not guess (unless asked to guess by the user).
> A correct system is set up in such a way that the boot parameters tell it
> 1. the root disk, 2. the partition type of the root disk,
> 3. the root partition, 4. the filesystem type of the root filesystem.
> Now the root disk can contain configuration data that causes the system
> to look at specified disks in specified ways, or to do default things.
>
> With a system that was set up correctly, your nonsense partitions
> would never have been found by the kernel, and I suppose mount by label
> would not have encountered any problems.]
I have substantial sympathy with your aside. But people don't want to
have to configure things. They just want it to "work". I get this
with md/raid all the time. "Just find all the arrays and assemble
them". Like with partitions, that isn't always a well defined
operation. But most of the time it is, and that is what people think
they want.
Ho hum.
Thanks for your input.
NeilBrown
Ar Gwe, 2006-10-13 am 09:50 +1000, ysgrifennodd Neil Brown:
> So: Is there any good reason to not clip the partitions to fit
> within the device - and discard those that are completely beyond
> the end of the device??
Its close but not quite the right approach
> The patch at the end of the mail does that. Is it OK to submit this
> to mainline?
No I think not. Any partition which is partly outside the disk should be
ignored entirely, that ensures it doesn't accidentally get mounted and
trashed by an HPA or similar mixup.
I agree it should be fixed, I just don't think your fix is actually a
real fix in this case.
From: "Alan Cox" <[email protected]>
> Ar Gwe, 2006-10-13 am 09:50 +1000, ysgrifennodd Neil Brown:
>> So: Is there any good reason to not clip the partitions to fit
>> within the device - and discard those that are completely beyond
>> the end of the device??
>
> Its close but not quite the right approach
>
>> The patch at the end of the mail does that. Is it OK to submit this
>> to mainline?
>
> No I think not. Any partition which is partly outside the disk should be
> ignored entirely, that ensures it doesn't accidentally get mounted and
> trashed by an HPA or similar mixup.
This is also a risk for users who have a partition that was poorly
setup with an earlier version that allowed it to almost work. If there
is data on that partition refusing to mount it can lead to massive
data loss akin to a broken disk drive.
<<jdow>> I'd propose allowing it to mount read-only and forcing
read-only mode when attempts to mount read-write are made. That way
the user never perceives a data loss situation and can take the
appropriate steps to repair the partitioning error.
{^_^} Joanne Dow
On Fri, Oct 13, 2006 at 09:50:49AM +1000, you [Neil Brown] wrote:
>
> Hi,
> I was looking into an issue that someone was having with raid5.
> They made an md/raid5 out of 5 whole devices and by luck the data
> that was written to the first block of the 5th device looked
> slightly like a partition table. fdisk output below for the curious.
> However some partitions were beyond the end of the device.
That reminds me of an old long-standing mystery I had with a machine that
had a RAID-5 of three whole devices.
I wonder if there's ever a change the kernel partition detection code could
_write_ on the disk, even when there's really no partition table?
Below is a description of the problem. I'm afraid I only replicated in on
2.2 and 2.4, but it just might still be present in 2.6. Unfortunately, the
actual raid device is now on different disks that have partition table on
them. I'm just asking if this rings any bells, since I really spent a long
time debugging it, and never found a real clue.
------------------------------------------------------------
The raid device is:
md4 : active raid5 hdc[2] hdb[1] hdg[0] 156367872 blocks level 5, 16k chunk, algorithm 0 [3/3] [UUU]
The kernel is 2.2.x + RAID-0.90 patch. Fs is ext2. After unmounting the
filesystem, I can mount it again without problems. I can also raidstop the
raid device in between and all is still fine:
-> umount /dev/md4; mount /dev/md4
- no corruption
-> umount /dev/md4; raidstop /dev/md4; raidstart /dev/md4; mount /dev/md4
- no corruption
But after a reboot, the filesystem is corrupted - few bytes differ in the
beginning of /dev/md4 between 1k and and 5k.
cmp -l md4 afterboot/md4
1083 1 3
4641 35 0
4642 205 0
4643 10 0
bytepos after before
boot boot
I found out that the difference (corruption) is usually on three bytes on
/dev/hdg, but sometimes on /dev/hdc, too. (/dev/md4 = hdb+hdc+hdg; hdb&hdc
are on i810, hdg is on hpt370).
First, I did
umount /dev/md4
raidstop /dev/md4
head -c 50k /dev/hdg > /save/hdg
reboot
To rule out kernel raid autodetect and raid code in general, I
booted 2.2.25 with "single init=/bin/bash raid=noautodetect".
head -c50k /dev/hdg | cmp -l /save/hdg
Three bytes differed:
4641 0 35
4642 0 205
4643 0 10
bytepos after before
boot boot
wrote the original stuff back:
dd if=/save/hdg /dev/hdg
sync
hdparm -W0 /dev/hdg
sync
reboot
Booted 2.2.25 with "single init=/bin/bash raid=noautodetect" again.
Did
head -c50k /dev/hdg | cmp -l /save/hdg
Three same three bytes differed again.
Wrote the stuff back, synced, did hdparm, and powered off. Still, the the
bytes differed on next boot.
Then I booted 2.4.21 with "single init=/bin/bash raid=noautodetect" (I
happened to have 2.4.21 compiled with suitable drivers at hand). Wrote the
same stuff back with dd, synced, turned ide cache off.
Booted 2.4.21 with "single init=/bin/bash raid=noautodetect" again. Did the
diff; the three bytes differed again.
Note that sometimes few bytes on hdc differed, too. Usually it was just the
three hdg bytes.
So this is not a 2.2 kernel issue. I suspect it might not be a kernel issue
at all. Unless it is a bug in kernel partition detection that is still
present in 2.4.x, perhaps in 2.6.
I tried to turn off the ide write cache with hdparm -W0, so it
shouldn't be a write caching issue.
If it's a bios issue, it's really a strange one, since it affects
both disks on i810 ide and on hpt370. The disks have no partition table,
though, which _could_ confuse the bios.
Any ideas? Who could write to those three bytes, and why?
On Sunday October 15, [email protected] wrote:
> On Fri, Oct 13, 2006 at 09:50:49AM +1000, you [Neil Brown] wrote:
> >
> > Hi,
> > I was looking into an issue that someone was having with raid5.
> > They made an md/raid5 out of 5 whole devices and by luck the data
> > that was written to the first block of the 5th device looked
> > slightly like a partition table. fdisk output below for the curious.
> > However some partitions were beyond the end of the device.
>
> That reminds me of an old long-standing mystery I had with a machine that
> had a RAID-5 of three whole devices.
>
> I wonder if there's ever a change the kernel partition detection code could
> _write_ on the disk, even when there's really no partition table?
No, kernel partition detection never writes.
I can't imagine what could have been causing your very-interesting
corruption. However as it was only demonstrated on old code and
cannot be explored further, I suspect we just have to forget it and
move one :-(
NeilBrown
On Friday October 13, [email protected] wrote:
> Ar Gwe, 2006-10-13 am 09:50 +1000, ysgrifennodd Neil Brown:
> > So: Is there any good reason to not clip the partitions to fit
> > within the device - and discard those that are completely beyond
> > the end of the device??
>
> Its close but not quite the right approach
>
> > The patch at the end of the mail does that. Is it OK to submit this
> > to mainline?
>
> No I think not. Any partition which is partly outside the disk should be
> ignored entirely, that ensures it doesn't accidentally get mounted and
> trashed by an HPA or similar mixup.
Hmmm.. So Alan things a partially-outside-this-disk partition
shouldn't show up at all, and Andries thinks it should.
And both give reasonably believable justifications.
Maybe we need a kernel parameter? How about this?
NeilBrown
-----------------------------
Don't allow partitions to start or end beyond the end of the device.
Corrupt partitions tables can cause wierd partitions that confuse
programs. This is confusion that can be avoided.
If a partition appears to start at or beyond the end of a device, we
don't enable it.
If it starts within the device but ends after the end, we clip it to
fit within the device.
Not enabling partitions does not affect partition numbering of
subsequent partitions.
This change applies to partitions found by fs/partitions/check.c
and to partitions explicitly created via an ioctl.
There is no uniform agreement on whether partitions that extend
beyond the end of the device should be clipped or discarded.
Discarding is safer as it makes corruption less likely.
Clipping is more flexable and gives continued access to the partition.
So provide a kernel-parameter which a 'safe' default.
partitions=strict
is the default
partitions=relaxed
means that partitions are clipped rather than rejected.
This kernel parameters only applies to auto-detected partitions,
not those set by ioctl.
Signed-off-by: Neil Brown <[email protected]>
### Diffstat output
./Documentation/kernel-parameters.txt | 8 ++++++++
./block/ioctl.c | 6 ++++++
./fs/partitions/check.c | 28 ++++++++++++++++++++++++++--
3 files changed, 40 insertions(+), 2 deletions(-)
diff .prev/Documentation/kernel-parameters.txt ./Documentation/kernel-parameters.txt
--- .prev/Documentation/kernel-parameters.txt 2006-10-16 10:03:40.000000000 +1000
+++ ./Documentation/kernel-parameters.txt 2006-10-16 10:06:42.000000000 +1000
@@ -1148,6 +1148,14 @@ and is between 256 and 4096 characters.
Currently this function knows 686a and 8231 chips.
Format: [spp|ps2|epp|ecp|ecpepp]
+ partitions= How to interpret partition information that
+ could be corrupt.
+ 'strict' is the default. Partitions that
+ don't fit in the device are rejected.
+ 'relaxed' is an option. Partitions that start
+ within the device be end beyond the end are
+ clipped.
+
pas2= [HW,OSS] Format:
<io>,<irq>,<dma>,<dma16>,<sb_io>,<sb_irq>,<sb_dma>,<sb_dma16>
diff .prev/block/ioctl.c ./block/ioctl.c
--- .prev/block/ioctl.c 2006-10-13 14:30:56.000000000 +1000
+++ ./block/ioctl.c 2006-10-16 09:54:40.000000000 +1000
@@ -42,6 +42,12 @@ static int blkpg_ioctl(struct block_devi
|| pstart < 0 || plength < 0)
return -EINVAL;
}
+ /* Does partition fit within device */
+ if (start + length > get_capacity(disk)) {
+ if (start >= get_capacity(disk))
+ return -EINVAL;
+ length = get_capacity(disk) - start;
+ }
/* partition number in use? */
mutex_lock(&bdev->bd_mutex);
if (disk->part[part - 1]) {
diff .prev/fs/partitions/check.c ./fs/partitions/check.c
--- .prev/fs/partitions/check.c 2006-10-13 14:30:56.000000000 +1000
+++ ./fs/partitions/check.c 2006-10-16 10:07:53.000000000 +1000
@@ -443,6 +443,8 @@ exit:
}
}
+static int strict_partitions = 1;
+
int rescan_partitions(struct gendisk *disk, struct block_device *bdev)
{
struct parsed_partitions *state;
@@ -466,8 +468,22 @@ int rescan_partitions(struct gendisk *di
if (!size)
continue;
if (from + size > get_capacity(disk)) {
- printk(" %s: p%d exceeds device capacity\n",
- disk->disk_name, p);
+ if (strict_partitions) {
+ printk(KERN_INFO
+ " %s: p%d ends beyond end of device"
+ " - ignoring\n", disk->disk_name, p);
+ continue;
+ }
+ if (from >= get_capacity(disk)) {
+ printk(KERN_INFO
+ " %s: p%d starts beyond end of device"
+ " - ignoring\n", disk->disk_name, p);
+ continue;
+ }
+ printk(KERN_INFO
+ " %s: p%d exceeds device capacity - clipping\n",
+ disk->disk_name, p);
+ size = get_capacity(disk) - from;
}
add_partition(disk, p, from, size);
#ifdef CONFIG_BLK_DEV_MD
@@ -479,6 +495,14 @@ int rescan_partitions(struct gendisk *di
return 0;
}
+static int __init part_setup(char *str)
+{
+ strict_partitions = strcmp(str, "relaxed");
+ return 1;
+}
+
+__setup("partitions=", part_setup);
+
unsigned char *read_dev_sector(struct block_device *bdev, sector_t n, Sector *p)
{
struct address_space *mapping = bdev->bd_inode->i_mapping;
On Mon, 16 Oct 2006 10:08:50 +1000 Neil Brown wrote:
> On Friday October 13, [email protected] wrote:
> > Ar Gwe, 2006-10-13 am 09:50 +1000, ysgrifennodd Neil Brown:
> > > So: Is there any good reason to not clip the partitions to fit
> > > within the device - and discard those that are completely beyond
> > > the end of the device??
> >
> > Its close but not quite the right approach
> >
> > > The patch at the end of the mail does that. Is it OK to submit this
> > > to mainline?
> >
> > No I think not. Any partition which is partly outside the disk should be
> > ignored entirely, that ensures it doesn't accidentally get mounted and
> > trashed by an HPA or similar mixup.
>
> Hmmm.. So Alan things a partially-outside-this-disk partition
> shouldn't show up at all, and Andries thinks it should.
> And both give reasonably believable justifications.
>
> Maybe we need a kernel parameter? How about this?
>
> NeilBrown
>
>
> -----------------------------
> Don't allow partitions to start or end beyond the end of the device.
>
> Corrupt partitions tables can cause wierd partitions that confuse
> programs. This is confusion that can be avoided.
>
> If a partition appears to start at or beyond the end of a device, we
> don't enable it.
> If it starts within the device but ends after the end, we clip it to
> fit within the device.
>
> Not enabling partitions does not affect partition numbering of
> subsequent partitions.
>
> This change applies to partitions found by fs/partitions/check.c
> and to partitions explicitly created via an ioctl.
>
> There is no uniform agreement on whether partitions that extend
> beyond the end of the device should be clipped or discarded.
> Discarding is safer as it makes corruption less likely.
> Clipping is more flexable and gives continued access to the partition.
> So provide a kernel-parameter which a 'safe' default.
>
> partitions=strict
> is the default
> partitions=relaxed
> means that partitions are clipped rather than rejected.
> This kernel parameters only applies to auto-detected partitions,
> not those set by ioctl.
>
>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./Documentation/kernel-parameters.txt | 8 ++++++++
> ./block/ioctl.c | 6 ++++++
> ./fs/partitions/check.c | 28 ++++++++++++++++++++++++++--
> 3 files changed, 40 insertions(+), 2 deletions(-)
>
> diff .prev/Documentation/kernel-parameters.txt ./Documentation/kernel-parameters.txt
> --- .prev/Documentation/kernel-parameters.txt 2006-10-16 10:03:40.000000000 +1000
> +++ ./Documentation/kernel-parameters.txt 2006-10-16 10:06:42.000000000 +1000
> @@ -1148,6 +1148,14 @@ and is between 256 and 4096 characters.
> Currently this function knows 686a and 8231 chips.
> Format: [spp|ps2|epp|ecp|ecpepp]
>
> + partitions= How to interpret partition information that
> + could be corrupt.
> + 'strict' is the default. Partitions that
> + don't fit in the device are rejected.
> + 'relaxed' is an option. Partitions that start
> + within the device be end beyond the end are
s/be/but/ ??
> + clipped.
> +
---
~Randy
Neil Brown wrote:
> On Friday October 13, [email protected] wrote:
> > No I think not. Any partition which is partly outside the disk should be
> > ignored entirely, that ensures it doesn't accidentally get mounted and
> > trashed by an HPA or similar mixup.
>
> Hmmm.. So Alan things a partially-outside-this-disk partition
> shouldn't show up at all, and Andries thinks it should.
> And both give reasonably believable justifications.
>
> Maybe we need a kernel parameter? How about this?
How about something in /proc or /sys to force the partition to show or not?
Maybe something like
echo 1 > /sys/block/hda/hda4/visible
?
Just an idea, my view would be not to introduce a new kernel parameter since
you have to reboot just to see it. It's quite possible this would effect
USB/Firewire storage devices. I have never been bitten by this problem
however it would be nice not to have to reboot to solve one like this say on
a slightly corrupted usb memory stick.
A possibility of using words instead of numbers, in the above example could
be one of: yes r/o no indicating full r/w, r/o, or not visible.
I'm not so sure if the size of the partition that is visible should be the
size according to the partition table or the size according to the start of
the partition to the end of the device. I would probably choose to have the
size to be the smaller number.
I was bitten by a 2.4 to 2.6 conversion where the size of the disk was 1
sector larger than what I saw in 2.4, but it was a minor problem which was
solved by masking off bit 0 of the size.
--
Lab tests show that use of micro$oft causes cancer in lab animals
Got Gas???
On Oct 15, 2006, at 20:08:50, Neil Brown wrote:
> Hmmm.. So Alan things a partially-outside-this-disk partition
> shouldn't show up at all, and Andries thinks it should.
> And both give reasonably believable justifications.
>
> Maybe we need a kernel parameter? How about this?
>
> [...snip...]
> So provide a kernel-parameter which a 'safe' default.
>
> partitions=strict
> is the default
> partitions=relaxed
> means that partitions are clipped rather than rejected.
> This kernel parameters only applies to auto-detected partitions,
> not those set by ioctl.
Perhaps it should also support partitions=none; so that those of us
who want to use a small userspace program and device-mapper to do the
partition discovery and access may do so without worrying about how
the kernel perceives the partitions (if at all). It also makes it
fairly trivial to add support for entirely new parition types without
having to modify the kernel at all.
Cheers,
Kyle Moffett
On Mon, Oct 16, 2006 at 09:50:15AM +1000, you [Neil Brown] wrote:
> On Sunday October 15, [email protected] wrote:
> > On Fri, Oct 13, 2006 at 09:50:49AM +1000, you [Neil Brown] wrote:
> > >
> > > Hi,
> > > I was looking into an issue that someone was having with raid5.
> > > They made an md/raid5 out of 5 whole devices and by luck the data
> > > that was written to the first block of the 5th device looked
> > > slightly like a partition table. fdisk output below for the curious.
> > > However some partitions were beyond the end of the device.
> >
> > That reminds me of an old long-standing mystery I had with a machine that
> > had a RAID-5 of three whole devices.
> >
> > I wonder if there's ever a change the kernel partition detection code could
> > _write_ on the disk, even when there's really no partition table?
>
> No, kernel partition detection never writes.
Well yeah, that sounded very improbable to begin with. But then again, there
were very few things that could've done it apart from kernel early boot/late
shutdown code, bios and cosmic radiation.
> I can't imagine what could have been causing your very-interesting
> corruption. However as it was only demonstrated on old code and
> cannot be explored further, I suspect we just have to forget it and
> move one :-(
Yes, that's certainly the most sensible thing to do.
Although it did happen with both 2.2 and 2.4 (never tried 2.6
unfortunately), but only on a single box, and I never heard of anybody else
seeing anything similar, so it can't be a wide spread problem even it still
were present.
Thanks for the reply.
-- v --
[email protected]
On Mon, Oct 16, 2006 at 09:50:15AM +1000, Neil Brown wrote:
> On Sunday October 15, [email protected] wrote:
> > I wonder if there's ever a change the kernel partition detection code could
> > _write_ on the disk, even when there's really no partition table?
>
> No, kernel partition detection never writes.
There is something else that writes, however, that I have gotten complaints about.
(But I have not investigated.)
People doing forensics take a copy of a disk and want to preserve
that copy as-is, never changing a single bit, only looking at it.
But it is reported that also when a partition is mounted read-only,
the journaling code of ext3 will write to the journal.
Andries
On Mon, Oct 16, 2006 at 08:02:27AM +0200, you [Andries Brouwer] wrote:
> On Mon, Oct 16, 2006 at 09:50:15AM +1000, Neil Brown wrote:
> > On Sunday October 15, [email protected] wrote:
>
> > > I wonder if there's ever a change the kernel partition detection code could
> > > _write_ on the disk, even when there's really no partition table?
> >
> > No, kernel partition detection never writes.
>
> There is something else that writes, however, that I have gotten complaints about.
> (But I have not investigated.)
> People doing forensics take a copy of a disk and want to preserve
> that copy as-is, never changing a single bit, only looking at it.
> But it is reported that also when a partition is mounted read-only,
> the journaling code of ext3 will write to the journal.
Yes, that's (sort of) known. As in "mentioned on lkml" - perhaps even in
some documentation.
In this case, the fs was ext2, though, and the problem occurred even when
the fs was never mounted and the raid device was never started.
-- v --
[email protected]
On Mon, 16 Oct 2006, Andries Brouwer wrote:
> On Mon, Oct 16, 2006 at 09:50:15AM +1000, Neil Brown wrote:
> > On Sunday October 15, [email protected] wrote:
>
> > > I wonder if there's ever a change the kernel partition detection code could
> > > _write_ on the disk, even when there's really no partition table?
> >
> > No, kernel partition detection never writes.
>
> There is something else that writes, however, that I have gotten complaints about.
> (But I have not investigated.)
> People doing forensics take a copy of a disk and want to preserve
> that copy as-is, never changing a single bit, only looking at it.
> But it is reported that also when a partition is mounted read-only,
> the journaling code of ext3 will write to the journal.
xfs has a similar problem -- on umount it writes even if the filesystem
was mounted read-only. for example here's a test with a small loopback
device, note the two WRITEs to loop0 near the bottom of the block_dump.
# echo 1 >/proc/sys/vm/block_dump; mount -o ro /dev/loop/0 /mnt; sleep 1; umount /mnt; sleep 1; echo 0 >/proc/sys/vm/block_dump
<7>mount(14602): READ block 0 on loop0
<7>mount(14602): READ block 8 on loop0
<7>mount(14602): READ block 16 on loop0
<7>mount(14602): READ block 24 on loop0
<7>mount(14602): READ block 32 on loop0
<7>mount(14602): READ block 40 on loop0
<7>mount(14602): READ block 48 on loop0
<7>mount(14602): READ block 56 on loop0
<7>mount(14602): READ block 64 on loop0
<7>mount(14602): READ block 72 on loop0
<7>mount(14602): READ block 80 on loop0
<7>mount(14602): READ block 88 on loop0
<7>mount(14602): READ block 96 on loop0
<7>mount(14602): READ block 104 on loop0
<7>mount(14602): READ block 112 on loop0
<7>mount(14602): READ block 120 on loop0
<7>mount(14602): READ block 128 on loop0
<7>mount(14602): READ block 136 on loop0
<7>mount(14602): READ block 144 on loop0
<7>mount(14602): READ block 152 on loop0
<7>mount(14602): READ block 160 on loop0
<7>mount(14602): READ block 168 on loop0
<7>mount(14602): READ block 176 on loop0
<7>mount(14602): READ block 184 on loop0
<7>mount(14602): READ block 192 on loop0
<7>mount(14602): READ block 200 on loop0
<7>mount(14602): READ block 208 on loop0
<7>mount(14602): READ block 216 on loop0
<7>mount(14602): READ block 224 on loop0
<7>mount(14602): READ block 232 on loop0
<7>mount(14602): READ block 240 on loop0
<7>mount(14602): READ block 248 on loop0
<7>mount(14602): READ block 0 on loop0
<7>mount(14602): READ block 524280 on loop0
<5>XFS mounting filesystem loop0
<7>mount(14602): READ block 262176 on loop0
<7>mount(14602): READ block 271775 on loop0
<7>mount(14602): READ block 266975 on loop0
<7>mount(14602): READ block 264575 on loop0
<7>mount(14602): READ block 263375 on loop0
<7>mount(14602): READ block 262775 on loop0
<7>mount(14602): READ block 262475 on loop0
<7>mount(14602): READ block 262325 on loop0
<7>mount(14602): READ block 262250 on loop0
<7>mount(14602): READ block 262213 on loop0
<7>mount(14602): READ block 262194 on loop0
<7>mount(14602): READ block 262185 on loop0
<7>mount(14602): READ block 262180 on loop0
<7>mount(14602): READ block 262178 on loop0
<7>mount(14602): READ block 262179 on loop0
<7>mount(14602): READ block 262176 on loop0
<7>mount(14602): READ block 262176 on loop0
<7>mount(14602): READ block 262179 on loop0
<7>mount(14602): READ block 262178 on loop0
<7>mount(14602): READ block 262179 on loop0
<7>mount(14602): READ block 64 on loop0
<7>Ending clean XFS mount for filesystem: loop0
<7>mount(14602): dirtied inode 8880357 (blkid.tab-rOdAVQ) on md3
<7>mount(14602): dirtied inode 8880459 (?) on md3
<7>md3_raid1(821): WRITE block 147539744 on sdb4
<7>md3_raid1(821): WRITE block 147539744 on sda4
<7>md3_raid1(821): WRITE block 147539760 on sdb4
<7>md3_raid1(821): WRITE block 147539760 on sda4
<7>md3_raid1(821): WRITE block 147539768 on sdb4
<7>md3_raid1(821): WRITE block 147539768 on sda4
<7>md3_raid1(821): WRITE block 147539720 on sdb4
<7>md3_raid1(821): WRITE block 147539720 on sda4
<7>md3_raid1(821): WRITE block 147539784 on sdb4
<7>md3_raid1(821): WRITE block 147539784 on sda4
<7>umount(14607): WRITE block 0 on loop0
<7>loop0(14289): dirtied inode 19 (test) on md3
<7>umount(14607): WRITE block 0 on loop0
<7>umount(14607): dirtied inode 8880877 (mtab.tmp) on md3
<7>zsh(4125): dirtied inode 4026531942 (block_dump) on proc
-dean
On Mon, Oct 16, 2006 at 10:08:50AM +1000, Neil Brown wrote:
> Hmmm.. So Alan thinks a partially-outside-this-disk partition
> shouldn't show up at all, and Andries thinks it should.
> And both give reasonably believable justifications.
>
> Maybe we need a kernel parameter?
If you introduce a kernel parameter, let it be one that tells the kernel
to stay away from partitions, so that partitions can be added later by
the partx ioctls.
Now setting up an initrd or so is kind of tricky, not something ordinary users
would want to do, so if it moreover is possible to specify the rootpttype
that would allow an ordinary boot and leave all other block devices untouched.
> Not enabling partitions does not affect partition numbering of
> subsequent partitions.
A funny effect might be that hda5 exists, hda6 does not, and hda7 exists again.
Maybe unexpected for some software.
Andries
On Mon, 16 Oct 2006, Andries Brouwer wrote:
> A funny effect might be that hda5 exists, hda6 does not, and hda7 exists again.
> Maybe unexpected for some software.
why would that be unexpected? that seems entirely normal these days with
udev...
# ls /dev/hda?
/dev/hda1 /dev/hda2 /dev/hda5
-dean
On Mon, 2006-10-16 at 10:08 +1000, Neil Brown wrote:
> On Friday October 13, [email protected] wrote:
> > Ar Gwe, 2006-10-13 am 09:50 +1000, ysgrifennodd Neil Brown:
> > > So: Is there any good reason to not clip the partitions to fit
> > > within the device - and discard those that are completely beyond
> > > the end of the device??
> >
> > Its close but not quite the right approach
> >
> > > The patch at the end of the mail does that. Is it OK to submit this
> > > to mainline?
> >
> > No I think not. Any partition which is partly outside the disk should be
> > ignored entirely, that ensures it doesn't accidentally get mounted and
> > trashed by an HPA or similar mixup.
>
> Hmmm.. So Alan things a partially-outside-this-disk partition
> shouldn't show up at all, and Andries thinks it should.
> And both give reasonably believable justifications.
Maybe the whole part table should be marked as "weird" to let userspace
run a diagnostics/repair tool on the disk.
Xav
Ar Llu, 2006-10-16 am 09:28 +0200, ysgrifennodd Xavier Bestel:
> > Hmmm.. So Alan things a partially-outside-this-disk partition
> > shouldn't show up at all, and Andries thinks it should.
> > And both give reasonably believable justifications.
>
> Maybe the whole part table should be marked as "weird" to let userspace
> run a diagnostics/repair tool on the disk.
I actually like the "read only" suggestion that was made. Allow data
access but protect from damage.
Both options will allow 'repair' of a broken partition table as the
partition will show up in fdisk which accesses stuff directly, and when
that causes a revalidate it will re-appear to the kernel.
Incidentally the question of exactly what libata should do about HPA
handling also needs sorting out.
Alan
On Mon, Oct 16, 2006 at 11:54:32AM +0100, Alan Cox wrote:
> Incidentally the question of exactly what libata should do about HPA
> handling also needs sorting out.
HPA isn't even uncommon, lots of laptops use it for recovery purposes
it seems.