From: TJ <[email protected]>
Applies to: up-to and including 2.6.20-rc7
This rare but critical bug has the potential to cause a hardware failure on disk drives by
allowing the system to repeatedly attempt to seek to sectors beyond the end of the physical
disk, causing sustained 'head banging'.
The bug particularly affects dmraid-managed RAID 1 stripes of the type hde+hdf where the
first physical disk hde contains a standard partition table which relates to the larger
logical disk represented by hde+hdf.
The essence is that probing of physical disks that are part of a larger logical disk
should be prevented because those disks will be managed by a driver that loads later in the
boot sequence.
This patch doesn't prevent probing of disks with 'sane' partition table entries.
At boot-time when drives are being probed the disks are scanned for partition tables by
fs/partitions/check.c:check_partition() which makes calls to all registered partition-types.
In the case of the commonly used "msdos" partition-type used for Linux, BSD, Solaris, MS-DOS,
extended and others, the checking is done in
fs/partitions/msdos.c:msdos_partition().
The partition table is only checked for validity based on the 'magic bytes' 55AA in the boot
sector. The sector values in the partition table are copied without any checks to ensure they
are within the bounds of the disk device.
As a result, block devices are created based on the partition structures and then various
file-systems are given the task of scanning the partition to determine if it is one they will
manage.
This scanning, in a partition that has sector numbers outside the bounds of the device, causes
the errors.
Signed-off-by: TJ <[email protected]>
---
I'm not sure if this bug will affect mdraid RAID-1 stripes, or other software
RAID configurations.
The bug was discovered on a RAID 1+0 array consisting of 4x60GB drives on a
Promise FastTrak PDC20271 2-channel IDE controller (hde+hdf mirrored to hdg+hdh)
with logical block addressing (LBA).
There are 3 prolonged periods of disk-probing each lasting about 20 seconds
during which the 'head banging' is quite scary. The first two occur during the
kernel boot, and the last will occur when a GUI environment such as Gnome
initialises.
In the system where this bug appeared this caused thousands of disk-read errors
during boot (which overflowed dmesg log), and 'head bangs' the drive(s) so hard
that sometimes the system has to be powered off for a considerable time before
the disk(s) will re-initialise.
--- fs/partitions/msdos.c 2006-11-29 21:57:37.000000000 +0000
+++ fs/partitions/msdos.tj.c 2007-01-31 23:39:09.000000000 +0000
@@ -399,6 +399,79 @@ static struct {
{NEW_SOLARIS_X86_PARTITION, parse_solaris_x86},
{0, NULL},
};
+
+/*
+ * Check that *all* sector offsets are valid before actually building the partition structure.
+ *
+ * This prevents physical damage to disks and boot-time problems caused by an apparently valid
+ * partition table causing attempts to read sectors beyond the end of the physical disk.
+ *
+ * This is especially important where this is the first physical disk in a striped RAID array
+ * and the partition table contains sector offsets into the larger logical disk (beyond the end
+ * of this physical disk).
+ *
+ * The RAID module will correctly manage the disks.
+ *
+ * The function is re-entrant so it can call itself to check extended partitions.
+ *
+ * @param p partition table
+ * @param bdev block device
+ * @returns -1 if insane values found; 0 otherwise
+ * @copy Copyright 31 January 2007
+ * @author TJ <[email protected]>
+ */
+int check_sane_values(struct partition *p, struct block_device *bdev) {
+ unsigned char *data;
+ struct partition *ext;
+ Sector sect;
+ int slot;
+ int insane;
+ int sector_size = bdev_hardsect_size(bdev) / 512;
+ int ret = 0; /* default is to report ok */
+
+ /* don't return early; allow all partition entries to be checked */
+ for (slot = 1 ; slot <= 4 ; slot++, p++) {
+ insane = 0; /* track sanity within each table entry */
+
+ if (NR_SECTS(p) == 0)
+ continue; /* ignore zero-sized entries */
+
+ if (START_SECT(p) > bdev->bd_disk->capacity-1) { /* invalid - beyond end of disk */
+ insane |= 1; /* bit-0 flags insane start */
+ }
+ if (START_SECT(p)+NR_SECTS(p)-1 > bdev->bd_disk->capacity-1) { /* invalid - beyond end of disk */
+ insane |= 2; /* bit-1 flags insane end */
+ }
+ if (!insane && is_extended_partition(p)) { /* check the extended partition */
+ data = read_dev_sector(bdev, START_SECT(p)*sector_size, §); /* fetch sector from cache */
+ if (data) {
+ if (msdos_magic_present(data + 510)) { /* check for signature */
+ ext = (struct partition *) (data + 0x1be);
+ ret = check_sane_values(ext, bdev); /* recursive call */
+ if (ret == -1) /* insanity found */
+ insane |= 4; /* bit-2 flags insane extended partition contents */
+ }
+ put_dev_sector(sect); /* release sector to cache */
+ }
+ else ret = -1; /* failed to read sector from cache */
+
+ }
+ if (insane) { /* insanity found; report it */
+ ret = -1; /* error code */
+ printk("\n"); /* start error report on a fresh line */
+ if (insane | 1)
+ printk(" partition %d: start (sector %d) beyond end of disk (sector %d)\n",
+ slot, START_SECT(p), (unsigned int) bdev->bd_disk->capacity-1);
+ if (insane | 2)
+ printk(" partition %d: end (sector %d) beyond end of disk (sector %d)\n",
+ slot, START_SECT(p)+NR_SECTS(p)-1, (unsigned int) bdev->bd_disk->capacity-1);
+ if (insane | 4)
+ printk(" partition %d: insane extended contents\n", slot);
+ }
+ }
+ return ret;
+}
+
int msdos_partition(struct parsed_partitions *state, struct block_device *bdev)
{
@@ -448,6 +521,18 @@ int msdos_partition(struct parsed_partit
#endif
p = (struct partition *) (data + 0x1be);
+ /*
+ * Check that *all* sector offsets are valid before actually building the partition structure
+ * Do it now rather than inside the loop that builds the partition entries to avoid having to
+ * unwind an unknown number of put_partition() calls in this loop and in the (possible) calls
+ * to parse_extended()
+ * Added by TJ <[email protected]>, 31 January 2007.
+ */
+ if (check_sane_values(p, bdev) == -1) {
+ put_dev_sector(sect); /* release to cache */
+ return -1; /* report invalid partition table */
+ }
+
/*
* Look for partitions in two passes:
* First find the primary and DOS-type extended partitions.
On Wed, 31 Jan 2007 23:50:39 +0000
TJ <[email protected]> wrote:
> + if (!insane && is_extended_partition(p)) { /* check the extended partition */
> + data = read_dev_sector(bdev, START_SECT(p)*sector_size, §); /* fetch sector from cache */
> + if (data) {
> + if (msdos_magic_present(data + 510)) { /* check for signature */
> + ext = (struct partition *) (data + 0x1be);
> + ret = check_sane_values(ext, bdev); /* recursive call */
The recursion is a concern. Is there any way in which a cunningly-crafted
device can cause sufficiently deep recursion to crash the kernel?
Also, from reading the replies I think we'd like to see some more
explanation of why this is necessary: are you really really sure that those
disks were incorrectly handling illegal sector numbers? Knowing the IBM
and Maxtor model numbers might be useful.
I assume you were using a driver in drivers/ide/? Perhaps this is really
an IDE-layer bug.
On 2/1/07, Andrew Morton <[email protected]> wrote:
> Also, from reading the replies I think we'd like to see some more
> explanation of why this is necessary: are you really really sure that those
> disks were incorrectly handling illegal sector numbers? Knowing the IBM
> and Maxtor model numbers might be useful.
I'd like to know the model numbers (and firmware revisions) too.
The drive knows how many LBAs it has, and making sure the span of the
command is within the valid LBA space is one of the first things
checked.
Another possibility of for clunking is if the drive is damaged and
cannot track follow in a region, the drive will go to the latch and
push off repeatedly attempting to reacquire servo information. That
can be scary sounding if you haven't heard it before.
Not saying the bug is impossible, but I'd like to know more.
--eric
On Thu, 2007-02-01 at 14:26 -0800, Andrew Morton wrote:
> The recursion is a concern. Is there any way in which a cunningly-crafted
> device can cause sufficiently deep recursion to crash the kernel?
I did wonder about this, but couldn't think of an *elegant* way of
mirroring the extended_partition logic that would cope with any changes
to its implementation in msdos.c in the future, without someone paying
attention and knowing to manually change it in check_sane_values().
I could manually reflect the current policy, which appears to allow MBR
+extended+extended as the current limit?
> Also, from reading the replies I think we'd like to see some more
> explanation of why this is necessary: are you really really sure that those
> disks were incorrectly handling illegal sector numbers?
Yes. Without this patch during boot the kernel log is full of Disk
Errors - thousands of them - which even when I increased the kernel log
size to 1MB sometimes couldn't cope.
I've thought of recording the noise it makes because for anyone familiar with drives
it makes you want to reach for the power-off switch immediately - its not healthy, and
as I reported, does lead to physical issues too.
The errors report the physical and logical sectors, which made it easy
to understand early-on what was happening (seek requests originating in file-system logic),
but it took much longer to identify the *why*.
---- short extract ---
DC202XX: Primary channel reset.
ide2: reset: success
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest
Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
end_request: I/O error, dev hde, sector 238276076
printk: 8 messages suppressed.
Buffer I/O error on device hde2, logical block 47279294
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest
Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest
Error }
-----------
The log-buffer overflow in itself is a big block on identifying the
actions that occur before the errors start, because if you don't have
the time and skills to build a new kernel with increased log size and
debugging there is no clue as to the reason.
I had to stick with it because the array contains partitions and data
from a Windows 2003 Server installation that had to be maintained -
couldn't just 'reformat'.
You can see the original bug report in the Ubuntu bug-tracker where I've
posted the findings and experience during the ongoing hunt for the cause
of the errors, where it shows *small* extracts of the errors being seen.
https://launchpad.net/ubuntu/+source/linux-source-2.6.17/+bug/77734
See particularly the additional comment I made dated
2007-01-25 18:35:54 UTC where I narrowed down the errors to only
affecting 8 unique sector addresses, repeated thousands of times.
Comparing those 8 unique sector addresses with the known partition
layout and file-systems on the disks helped me determine roughly what
was going on, which is when I switched my attention to the detection of
the partitions themselves.
> Knowing the IBM and Maxtor model numbers might be useful.
The Model number is Maxtor 6Y060L0.
I'm afraid I don't have the IBM model numbers because that system was
only used briefly to boot from a Flash-RAM card because it had a
matching FastTrak controller and RAID 1+0 array already defined.
> I assume you were using a driver in drivers/ide/? Perhaps this is really
> an IDE-layer bug.
It is the 'generic' code within drivers/ide/ that issues requests to the drives.
In the early stages I did consider hacking the ide-disk code to just
stop it attempting to request 'illegal' sectors, but that would have
been ugly, could have caused unforeseen consequences, and didn't deal
with the true issue, which is simply a lack of bounds-checking.
If this was a memory buffer bounds-checking would be a priority, but the
partition structure code neglected to do any, which is the root cause of
the issue.
When I started hunting this bug I dropped calls to dump_stack() in the
functions generating the errors, but they turned out to be the interrupt
task handlers (as you'll see in the Ubuntu bug report) which only led to
the work_queue.
The thrashing of the heads comes on some time after (in kernel terms)
the partition is scanned so I was focusing on the file-system, etc., and
thinking the issue was in one of the init scripts.
It took me a long while to track it back to the results of the partition
scan, but as soon as I implemented this patch the problem stopped and
the drives initialise properly.
TJ.
On 2/1/07, TJ <[email protected]> wrote:
> ---- short extract ---
> DC202XX: Primary channel reset.
> ide2: reset: success
> hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest
> Error }
> hde: task_in_intr: error=0x04 { DriveStatusError }
> ide: failed opcode was: unknown
> end_request: I/O error, dev hde, sector 238276076
> printk: 8 messages suppressed.
> Buffer I/O error on device hde2, logical block 47279294
> hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest
> Error }
> hde: task_in_intr: error=0x04 { DriveStatusError }
> ide: failed opcode was: unknown
> hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest
> Error }
...
> The Model number is Maxtor 6Y060L0.
In the log you posted, the drive posted 51/04 to your out-of-bounds
request, but these are 59/04, do they span the end of the device or
something?
I'd be surprised if the drive clunked because of an out of range
command, that model of drive used a very mature interface firmware
with many generations of testing and qualification.
On Thu, 2007-02-01 at 17:10 -0700, Eric D. Mudama wrote:
> In the log you posted, the drive posted 51/04 to your out-of-bounds
> request, but these are 59/04, do they span the end of the device or
> something?
As you can see in the error-log including in the Ubuntu bug report, the
51/04 errors are generated in dma_intr() and occur in the initial phase
when the driver is using DMA.
After a few of these errors it disables DMA and switches to PIO mode,
which is when the 59/04 errors are generated in task_in_intr().
TJ.