Hi,
Kernel both 2.4.16-pre7 and 2.5.1-pre8 have serious problem
when we use direct IO for the block device.
These kernel breaks my root filesystem which has /dev
(block device entry) when an application like dd write data
into /dev/sdXX (= not root filesystem's device).
My environment is:
/dev/hda root filesystem with /dev/ dir.
/dev/sda target direct io device
I lost my root filesystem with:
fd = open("/dev/sda", O_DIRECT);
write(fd, data, large_size);
The reason is that when kernel accesses /dev/sda with O_DIRECT,
blkdev_direct_IO() is called. But, it calls generic_direct_IO(),
and generic_direct_IO() calls brw_kiovec(..., inode->i_dev, ...).
If inode->i_dev is root filesystem, not target direct io device,
then inode->i_dev is used for root filesystem overwriting (= then
I lost my root file system...).
Here is patch, it was based some parts in 2.4.10 (and some 2.4.1x-aa).
In my test with this patch, it works well and
kernel does not break my root partition.
Linus and Marcelo, please apply this patch.
-- gotom
--- fs/block_dev.c.vanilla Sat Dec 8 22:46:15 2001
+++ fs/block_dev.c Mon Dec 10 10:45:20 2001
@@ -115,7 +115,25 @@
static int blkdev_direct_IO(int rw, struct inode * inode, struct kiobuf * iobuf, unsigned long blocknr, int blocksize)
{
- return generic_direct_IO(rw, inode, iobuf, blocknr, blocksize, blkdev_get_block);
+ int i, nr_blocks, retval;
+ unsigned long * blocks = iobuf->blocks;
+
+ nr_blocks = iobuf->length / blocksize;
+ /* build the blocklist */
+ for (i = 0; i < nr_blocks; i++, blocknr++) {
+ struct buffer_head bh;
+
+ retval = blkdev_get_block(inode, blocknr, &bh, 0);
+ if (retval)
+ goto out;
+
+ blocks[i] = bh.b_blocknr;
+ }
+
+ retval = brw_kiovec(rw, 1, &iobuf, inode->i_rdev, iobuf->blocks, blocksize);
+
+ out:
+ return retval;
}
static int blkdev_writepage(struct page * page)
On Mon, 10 Dec 2001, GOTO Masanori wrote:
>
> The reason is that when kernel accesses /dev/sda with O_DIRECT,
> blkdev_direct_IO() is called. But, it calls generic_direct_IO(),
> and generic_direct_IO() calls brw_kiovec(..., inode->i_dev, ...).
That's a bad bug, yes.
However, the bug is really in "generic_direct_IO()", and you should fix it
there, instead of avoiding to use it altogether.
"generic_direct_IO()" should just get the device from "bh->b_dev (which is
filled in correctly by "get_block()".
Linus
At Sun, 9 Dec 2001 21:55:57 -0800 (PST),
Linus Torvalds <[email protected]> wrote:
> On Mon, 10 Dec 2001, GOTO Masanori wrote:
> >
> > The reason is that when kernel accesses /dev/sda with O_DIRECT,
> > blkdev_direct_IO() is called. But, it calls generic_direct_IO(),
> > and generic_direct_IO() calls brw_kiovec(..., inode->i_dev, ...).
>
> That's a bad bug, yes.
Yes, dangerous bug...
> However, the bug is really in "generic_direct_IO()", and you should fix it
> there, instead of avoiding to use it altogether.
>
> "generic_direct_IO()" should just get the device from "bh->b_dev (which is
> filled in correctly by "get_block()".
Oh, that's right, the patch becomes more simple (and works well).
Is this patch OK?
--- linux.vanilla/fs/buffer.c Mon Dec 10 21:13:10 2001
+++ linux/fs/buffer.c Mon Dec 10 20:59:34 2001
@@ -2003,12 +2003,12 @@
{
int i, nr_blocks, retval;
unsigned long * blocks = iobuf->blocks;
+ struct buffer_head bh;
+ bh.b_dev = inode->i_dev;
nr_blocks = iobuf->length / blocksize;
/* build the blocklist */
for (i = 0; i < nr_blocks; i++, blocknr++) {
- struct buffer_head bh;
-
bh.b_state = 0;
bh.b_dev = inode->i_dev;
bh.b_size = blocksize;
@@ -2034,7 +2034,7 @@
blocks[i] = bh.b_blocknr;
}
- retval = brw_kiovec(rw, 1, &iobuf, inode->i_dev, iobuf->blocks, blocksize);
+ retval = brw_kiovec(rw, 1, &iobuf, bh.b_dev, iobuf->blocks, blocksize);
out:
return retval;
-- gotom
On Mon, Dec 10, 2001 at 09:39:51PM +0900, GOTO Masanori wrote:
> At Sun, 9 Dec 2001 21:55:57 -0800 (PST),
> Linus Torvalds <[email protected]> wrote:
> > On Mon, 10 Dec 2001, GOTO Masanori wrote:
> > >
> > > The reason is that when kernel accesses /dev/sda with O_DIRECT,
> > > blkdev_direct_IO() is called. But, it calls generic_direct_IO(),
> > > and generic_direct_IO() calls brw_kiovec(..., inode->i_dev, ...).
> >
> > That's a bad bug, yes.
>
> Yes, dangerous bug...
at least it's not security related :)
>
> > However, the bug is really in "generic_direct_IO()", and you should fix it
> > there, instead of avoiding to use it altogether.
> >
> > "generic_direct_IO()" should just get the device from "bh->b_dev (which is
> > filled in correctly by "get_block()".
>
> Oh, that's right, the patch becomes more simple (and works well).
> Is this patch OK?
yes it is.
Thanks and sorry for your troubles. BTW, if you read back my first post
about the blkdev-pagecache you'll notice I got bitten by the same bug
too and I had to reinstall the test-machine from scratch :). 2.4.10
didn't had such bug, but after integrating the direct-io with the blkdev
inode physical address space the bug seen the light again. I should have
fixed the below way since the first place, at that time there were two
functions anyways and so it was less obvious the below was the right fix
(previously I just did a one liner to fix it (s/dev/rdev/)).
> --- linux.vanilla/fs/buffer.c Mon Dec 10 21:13:10 2001
> +++ linux/fs/buffer.c Mon Dec 10 20:59:34 2001
> @@ -2003,12 +2003,12 @@
> {
> int i, nr_blocks, retval;
> unsigned long * blocks = iobuf->blocks;
> + struct buffer_head bh;
>
> + bh.b_dev = inode->i_dev;
> nr_blocks = iobuf->length / blocksize;
> /* build the blocklist */
> for (i = 0; i < nr_blocks; i++, blocknr++) {
> - struct buffer_head bh;
> -
> bh.b_state = 0;
> bh.b_dev = inode->i_dev;
> bh.b_size = blocksize;
> @@ -2034,7 +2034,7 @@
> blocks[i] = bh.b_blocknr;
> }
>
> - retval = brw_kiovec(rw, 1, &iobuf, inode->i_dev, iobuf->blocks, blocksize);
> + retval = brw_kiovec(rw, 1, &iobuf, bh.b_dev, iobuf->blocks, blocksize);
>
> out:
> return retval;
>
>
> -- gotom
Andrea
On Mon, 10 Dec 2001, GOTO Masanori wrote:
> >
> > "generic_direct_IO()" should just get the device from "bh->b_dev (which is
> > filled in correctly by "get_block()".
>
> Oh, that's right, the patch becomes more simple (and works well).
> Is this patch OK?
This looks fine to me. At some point we _may_ end up with filesystems that
understand about multiple devices, so it's possible in theory that
"get_block()" might return different devices for different blocks, but
that does not happen currently, and I'm not really sure it ever will.
Marcelo, I do believe this should go into 2.4.x too..
Linus
At Mon, 10 Dec 2001 09:10:24 -0800 (PST),
Linus Torvalds <[email protected]> wrote:
> On Mon, 10 Dec 2001, GOTO Masanori wrote:
> > >
> > > "generic_direct_IO()" should just get the device from "bh->b_dev (which is
> > > filled in correctly by "get_block()".
> >
> > Oh, that's right, the patch becomes more simple (and works well).
> > Is this patch OK?
Linus, Andrea, thank you for your comments!
> This looks fine to me. At some point we _may_ end up with filesystems that
> understand about multiple devices, so it's possible in theory that
> "get_block()" might return different devices for different blocks, but
> that does not happen currently, and I'm not really sure it ever will.
I don't test for the multiple device with direct IO under current
kernel..., I keep it in mind.
> Marcelo, I do believe this should go into 2.4.x too..
Thanks Marcelo for including my patch :)
I, however, found another problem.
Accessing with inode size unit (== 4096 byte) is ok, but if I accessed
with block size unit, generic_direct_IO() returns error. The reason
is that blocksize is designated as inode->i_blkbits, and its value is
not disk minimal block size (512), but inode's unit size (4096).
Patch is here. If inode is S_ISBLK and is not mounted, then blocksize
is set as hardsect_size. This implementation needs some computational
cost, but I think it can be ignored. This patch works well for me.
--- linux-2.4.17-pre8.vanilla/mm/filemap.c Tue Dec 11 15:54:57 2001
+++ linux-2.4.17-pre8/mm/filemap.c Tue Dec 11 16:21:40 2001
@@ -1508,8 +1508,22 @@
new_iobuf = 1;
}
- blocksize = 1 << inode->i_blkbits;
- blocksize_bits = inode->i_blkbits;
+ /*
+ * If inode is BLK and it is not already mounted,
+ * blocksize change hardsect_size.
+ */
+ if (S_ISBLK(inode->i_mode) && !is_mounted(inode->i_rdev)) {
+ int size;
+
+ size = get_hardsect_size(inode->i_rdev);
+ blocksize = size;
+ for (blocksize_bits = 0; !(size & 1); )
+ size >>= 1, blocksize_bits++;
+ } else {
+ blocksize = 1 << inode->i_blkbits;
+ blocksize_bits = inode->i_blkbits;
+ }
+
blocksize_mask = blocksize - 1;
chunk_size = KIO_MAX_ATOMIC_IO << 10;
The current problem is that is_mounted() checking is worked for
/dev/sda1 (partition area) only, not for /dev/sda (total disk area),
if /dev/sda1 is mounted. This problem is also existed in /dev/raw
interface, but I think it's not important issue (should I fix it?)
Thanks,
-- gotom
On Tue, 11 Dec 2001, GOTO Masanori wrote:
> + /*
> + * If inode is BLK and it is not already mounted,
> + * blocksize change hardsect_size.
> + */
> + if (S_ISBLK(inode->i_mode) && !is_mounted(inode->i_rdev)) {
That's obviously wrong. What's to stop it from getting mounted
right after that check?
At Tue, 11 Dec 2001 04:13:06 -0500 (EST),
Alexander Viro <[email protected]> wrote:
> > + /*
> > + * If inode is BLK and it is not already mounted,
> > + * blocksize change hardsect_size.
> > + */
> > + if (S_ISBLK(inode->i_mode) && !is_mounted(inode->i_rdev)) {
>
> That's obviously wrong. What's to stop it from getting mounted
> right after that check?
Ah! You're right. I was confused, I'm sorry.
New patch is here (and works well)... Ugh.
--- linux-2.4.17-pre8.vanilla/mm/filemap.c Tue Dec 11 15:54:57 2001
+++ linux-2.4.17-pre8/mm/filemap.c Tue Dec 11 19:36:09 2001
@@ -1508,8 +1508,21 @@
new_iobuf = 1;
}
- blocksize = 1 << inode->i_blkbits;
- blocksize_bits = inode->i_blkbits;
+ /*
+ * If inode is block device, blocksize is set as hardsect_size.
+ */
+ if (S_ISBLK(inode->i_mode)) {
+ int size;
+
+ size = get_hardsect_size(inode->i_rdev);
+ blocksize = size;
+ for (blocksize_bits = 0; !(size & 1); )
+ size >>= 1, blocksize_bits++;
+ } else {
+ blocksize = 1 << inode->i_blkbits;
+ blocksize_bits = inode->i_blkbits;
+ }
+
blocksize_mask = blocksize - 1;
chunk_size = KIO_MAX_ATOMIC_IO << 10;
-- gotom
On Tue, 11 Dec 2001, GOTO Masanori wrote:
> I, however, found another problem.
> Accessing with inode size unit (== 4096 byte) is ok, but if I accessed
> with block size unit, generic_direct_IO() returns error. The reason
> is that blocksize is designated as inode->i_blkbits, and its value is
> not disk minimal block size (512), but inode's unit size (4096).
That is not a bug, but a feature.
We _always_ have to do the IO in "inode size" chunks, and if you want to
change it, you have to change it at a higher level (ie you should set the
blocksize with the "BLKBSZSET" ioctl.)
Linus
On Tue, 11 Dec 2001, GOTO Masanori wrote:
>
> Accessing with inode size unit (== 4096 byte) is ok, but if I accessed
> with block size unit, generic_direct_IO() returns error. The reason
> is that blocksize is designated as inode->i_blkbits, and its value is
> not disk minimal block size (512), but inode's unit size (4096).
Actually, I now found the _real_ bug that explains both the "inode->i_dev"
_and_ the block size problem.
We have the wrong inode.
We use the on-disk inode, which is NOT the same as the "mapping" inode for
actually doing the IO.
This simple (and completely untested) patch should fix it.
This two-liner should also actually make the previous patch completely
unnecessary, because now "inode->i_dev" should be automatically correct. I
should have realized that inode->i_dev should always be right, and have
thought more about the fact that it wasn't.
Linus
-----
--- v2.5.0/linux/mm/filemap.c Wed Nov 21 14:07:25 2001
+++ linux/mm/filemap.c Tue Dec 11 09:34:17 2001
@@ -1486,8 +1485,8 @@
ssize_t retval;
int new_iobuf, chunk_size, blocksize_mask, blocksize, blocksize_bits, iosize, progress;
struct kiobuf * iobuf;
- struct inode * inode = filp->f_dentry->d_inode;
- struct address_space * mapping = inode->i_mapping;
+ struct address_space * mapping = filp->f_dentry->d_inode->i_mapping;
+ struct inode * inode = mapping->host;
new_iobuf = 0;
iobuf = filp->f_iobuf;
Linus Torvalds wrote:
>
> On Tue, 11 Dec 2001, GOTO Masanori wrote:
> >
> > Accessing with inode size unit (== 4096 byte) is ok, but if I accessed
> > with block size unit, generic_direct_IO() returns error. The reason
> > is that blocksize is designated as inode->i_blkbits, and its value is
> > not disk minimal block size (512), but inode's unit size (4096).
>
> Actually, I now found the _real_ bug that explains both the "inode->i_dev"
> _and_ the block size problem.
>
> We have the wrong inode.
>
> We use the on-disk inode, which is NOT the same as the "mapping" inode for
> actually doing the IO.
>
> This simple (and completely untested) patch should fix it.
>
> This two-liner should also actually make the previous patch completely
> unnecessary, because now "inode->i_dev" should be automatically correct. I
> should have realized that inode->i_dev should always be right, and have
> thought more about the fact that it wasn't.
Hah! This is explainig well, why my private experiment
to make the ext3_inode_info ext3_i field in struct inode
a pointer is still failing maliciously some how during the
booting of the system (actually the init proces goes
really after some [ OK ] optics. And this despite the
fact that there are currently only two VFS entry points
where a fresh new filesystem specific inode can be allocated.
Hello,
At Tue, 11 Dec 2001 09:37:37 -0800 (PST),
Linus Torvalds wrote:
>
>
> On Tue, 11 Dec 2001, GOTO Masanori wrote:
> >
> > Accessing with inode size unit (== 4096 byte) is ok, but if I accessed
> > with block size unit, generic_direct_IO() returns error. The reason
> > is that blocksize is designated as inode->i_blkbits, and its value is
> > not disk minimal block size (512), but inode's unit size (4096).
>
> Actually, I now found the _real_ bug that explains both the "inode->i_dev"
> _and_ the block size problem.
>
> We have the wrong inode.
>
> We use the on-disk inode, which is NOT the same as the "mapping" inode for
> actually doing the IO.
>
> This simple (and completely untested) patch should fix it.
>
> This two-liner should also actually make the previous patch completely
> unnecessary, because now "inode->i_dev" should be automatically correct. I
> should have realized that inode->i_dev should always be right, and have
> thought more about the fact that it wasn't.
Your patch will break Goto's fix because mapping->host does not have
correct i_mode. The patch below is for 2.4.17, I think it works on 2.5.1.
diff -Nur linux-2.4.17-pre7.org/fs/block_dev.c linux-2.4.17-pre7/fs/block_dev.c
--- linux-2.4.17-pre7.org/fs/block_dev.c Mon Dec 10 17:34:52 2001
+++ linux-2.4.17-pre7/fs/block_dev.c Mon Dec 10 17:48:21 2001
@@ -329,6 +329,7 @@
inode->i_bdev = new_bdev;
inode->i_data.a_ops = &def_blk_aops;
inode->i_data.gfp_mask = GFP_USER;
+ inode->i_mode = S_IFBLK;
spin_lock(&bdev_lock);
bdev = bdfind(dev, head);
if (!bdev) {
On Wed, 12 Dec 2001, Tachino Nobuhiro wrote:
>
> Your patch will break Goto's fix because mapping->host does not have
> correct i_mode.
Note that Goto's fix really is quite _fundamentally_ broken. Don't use it.
As explained in another email, you really should use the set-block-size
ioctl to set the correct blocksize for the device you are using. That
should work fine once the correct inode is used by the direct_io code..
Linus
At Tue, 11 Dec 2001 16:34:15 -0800 (PST),
Linus Torvalds wrote:
>
>
> On Wed, 12 Dec 2001, Tachino Nobuhiro wrote:
> >
> > Your patch will break Goto's fix because mapping->host does not have
> > correct i_mode.
>
> Note that Goto's fix really is quite _fundamentally_ broken. Don't use it.
>
OK.
But my patch fixes another bug. Current /dev/ram* does not return -ENOSPC
at the end of device size because generic_file_write() also checks whether
mapping->host is a block device. So I think the patch is required.
On Wed, 12 Dec 2001, Tachino Nobuhiro wrote:
>
> But my patch fixes another bug. Current /dev/ram* does not return -ENOSPC
> at the end of device size because generic_file_write() also checks whether
> mapping->host is a block device. So I think the patch is required.
I'll agree with your one-liner: it's good practice anyway to initialize
any fields that could ever be looked at. I actually already applied it to
my tree, I just want to make sure that people don't apply the other
patch..
Linus
At Tue, 11 Dec 2001 17:46:01 -0800 (PST),
Linus Torvalds wrote:
>
>
> On Wed, 12 Dec 2001, Tachino Nobuhiro wrote:
> >
> > But my patch fixes another bug. Current /dev/ram* does not return -ENOSPC
> > at the end of device size because generic_file_write() also checks whether
> > mapping->host is a block device. So I think the patch is required.
>
> I'll agree with your one-liner: it's good practice anyway to initialize
> any fields that could ever be looked at. I actually already applied it to
> my tree, I just want to make sure that people don't apply the other
> patch..
>
> Linus
Thank you.
I think the patch should be applied to 2.4 because "dd if=/dev/zero of=/dev/ram1"
can cause system hang easily.
Marcelo, please consider applying the patch.
At Wed, 12 Dec 2001 11:46:29 +0900,
Tachino Nobuhiro <[email protected]> wrote:
> At Tue, 11 Dec 2001 17:46:01 -0800 (PST),
> Linus Torvalds wrote:
> >
> >
> > On Wed, 12 Dec 2001, Tachino Nobuhiro wrote:
> > >
> > > But my patch fixes another bug. Current /dev/ram* does not return -ENOSPC
> > > at the end of device size because generic_file_write() also checks whether
> > > mapping->host is a block device. So I think the patch is required.
> >
> > I'll agree with your one-liner: it's good practice anyway to initialize
> > any fields that could ever be looked at. I actually already applied it to
> > my tree, I just want to make sure that people don't apply the other
> > patch..
> >
> > Linus
>
> Thank you.
>
> I think the patch should be applied to 2.4 because "dd if=/dev/zero of=/dev/ram1"
> can cause system hang easily.
Linus, Tachino, your patch are both right.
I was not aware mapping inode used block file inode... Umm.
In my test, it works fine for both block device and file, and resolves
"inode->i_dev" and block size problems.
Thank you for your (simple and complete) patch! I'm now happy :)
> Marcelo, please consider applying the patch.
Yes. Would you remove my previous patches and apply these patches ?
Thanks,
-- gotom
On Wed, 12 Dec 2001, GOTO Masanori wrote:
> At Wed, 12 Dec 2001 11:46:29 +0900,
> Tachino Nobuhiro <[email protected]> wrote:
> > At Tue, 11 Dec 2001 17:46:01 -0800 (PST),
> > Linus Torvalds wrote:
> > >
> > >
> > > On Wed, 12 Dec 2001, Tachino Nobuhiro wrote:
> > > >
> > > > But my patch fixes another bug. Current /dev/ram* does not return -ENOSPC
> > > > at the end of device size because generic_file_write() also checks whether
> > > > mapping->host is a block device. So I think the patch is required.
> > >
> > > I'll agree with your one-liner: it's good practice anyway to initialize
> > > any fields that could ever be looked at. I actually already applied it to
> > > my tree, I just want to make sure that people don't apply the other
> > > patch..
> > >
> > > Linus
> >
> > Thank you.
> >
> > I think the patch should be applied to 2.4 because "dd if=/dev/zero of=/dev/ram1"
> > can cause system hang easily.
>
> Linus, Tachino, your patch are both right.
> I was not aware mapping inode used block file inode... Umm.
>
> In my test, it works fine for both block device and file, and resolves
> "inode->i_dev" and block size problems.
> Thank you for your (simple and complete) patch! I'm now happy :)
>
> > Marcelo, please consider applying the patch.
>
> Yes. Would you remove my previous patches and apply these patches ?
Yes, I will do that for -rc1.