2014-06-19 15:35:56

by Theodore Ts'o

[permalink] [raw]
Subject: BUG: scheduling while atomic in blk_mq codepath?

While trying to bisect some problems which were introduced sometime
between 3.15 and 3.16-rc1 (specifically, (1) reads to a block device
at offset 262144 * 4k are failing with a short read, and (2) block
device reads are sometimes causing the entire kernel to hang), the
following BUG got hit.

[ 0.000000] Linux version 3.15.0-rc8-06047-gaaeb255 (tytso@closure) (gcc version 4.8.3 (Debian 4.8.3-2) ) #1902 SMP Thu Jun 19 11:16:10 EDT 2014

[....] Checking file systems...fsck from util-linux 2.20.1
/dev/vdg was not cleanly unmounted, check forced.
[ 4.161703] BUG: scheduling while atomic: fsck.ext4/2072/0x0000000266.5%
[ 4.163673] no locks held by fsck.ext4/2072.
[ 4.164318] Modules linked in:
[ 4.164845] CPU: 0 PID: 2072 Comm: fsck.ext4 Not tainted 3.15.0-rc8-06047-gaaeb255 #1902
[ 4.166047] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 4.166917] 00000000 00000000 f52c5ba0 c0832655 f5158610 f52c5bac c082f88a f6501e40
[ 4.168188] f52c5c20 c08362ca c0eb3e40 c0eb3e40 374d3933 00000001 0396a8da 00000000
[ 4.169474] f5158610 f51f1674 f4f46a00 f52c5be4 c015dd4b f4f46a00 f52c5bf0 c015dd5e
[ 4.170781] Call Trace:
[ 4.171159] [<c0832655>] dump_stack+0x48/0x60
[ 4.171838] [<c082f88a>] __schedule_bug+0x5c/0x6d
[ 4.172572] [<c08362ca>] __schedule+0x61/0x65a
[ 4.173228] [<c015dd4b>] ? kvm_clock_read+0x1f/0x29
[ 4.173977] [<c015dd5e>] ? kvm_clock_get_cycles+0x9/0xc
[ 4.174771] [<c01b4cb9>] ? timekeeping_get_ns.constprop.14+0x10/0x56
[ 4.175701] [<c0836922>] schedule+0x5f/0x61
[ 4.176345] [<c0836aa2>] io_schedule+0x50/0x67
[ 4.177060] [<c0423b2d>] bt_get+0xaf/0xd1
[ 4.177677] [<c0198282>] ? wake_up_atomic_t+0x1f/0x1f
[ 4.178444] [<c0423bfd>] blk_mq_get_tag+0x26/0x82
[ 4.179158] [<c0420f14>] __blk_mq_alloc_request+0x2a/0x169
[ 4.180022] [<c04222b5>] blk_mq_map_request+0x137/0x1e3
[ 4.180825] [<c0422f89>] blk_sq_make_request+0x82/0x145
[ 4.181630] [<c041a687>] generic_make_request+0x82/0xb5
[ 4.182430] [<c041a7aa>] submit_bio+0xf0/0x109
[ 4.183113] [<c019e97c>] ? trace_hardirqs_on_caller+0x14e/0x169
[ 4.184019] [<c025de72>] _submit_bh+0x1ad/0x1ca
[ 4.184661] [<c025de9e>] submit_bh+0xf/0x11
[ 4.185267] [<c025f5c9>] block_read_full_page+0x1e2/0x1f2
[ 4.186073] [<c025f8cd>] ? I_BDEV+0xa/0xa
[ 4.186695] [<c020ad30>] ? __lru_cache_add+0x24/0x46
[ 4.187452] [<c020af13>] ? lru_cache_add+0xd/0xf
[ 4.188130] [<c025fc04>] blkdev_readpage+0x14/0x16
[ 4.188832] [<c0209adf>] __do_page_cache_readahead+0x1c0/0x1eb
[ 4.189704] [<c0209cb9>] ondemand_readahead+0x1af/0x1b9
[ 4.190508] [<c0209d22>] page_cache_async_readahead+0x5f/0x6a
[ 4.191424] [<c0202370>] generic_file_aio_read+0x226/0x4f4
[ 4.192272] [<c0260841>] blkdev_aio_read+0x90/0x9e
[ 4.193017] [<c02385cd>] do_sync_read+0x52/0x79
[ 4.193731] [<c023857b>] ? fdput_pos+0x25/0x25
[ 4.194412] [<c0238d27>] vfs_read+0x72/0xd1
[ 4.195064] [<c02391da>] SyS_read+0x49/0x7c
[ 4.195700] [<c083a0c9>] syscall_call+0x7/0xb
[ 4.196385] [<c0830000>] ? print_usage_bug+0xcd/0x18e

Is any of these known problems? This is blocking me from doing any
kind of testing at the moment... (these problems are showing up while
running KVM using virtio devices).

- Ted


2014-06-19 15:59:42

by Jens Axboe

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic in blk_mq codepath?

On 2014-06-19 08:35, Theodore Ts'o wrote:
> While trying to bisect some problems which were introduced sometime
> between 3.15 and 3.16-rc1 (specifically, (1) reads to a block device
> at offset 262144 * 4k are failing with a short read, and (2) block
> device reads are sometimes causing the entire kernel to hang), the
> following BUG got hit.
>
> [ 0.000000] Linux version 3.15.0-rc8-06047-gaaeb255 (tytso@closure) (gcc version 4.8.3 (Debian 4.8.3-2) ) #1902 SMP Thu Jun 19 11:16:10 EDT 2014
>
> [....] Checking file systems...fsck from util-linux 2.20.1
> /dev/vdg was not cleanly unmounted, check forced.
> [ 4.161703] BUG: scheduling while atomic: fsck.ext4/2072/0x0000000266.5%
> [ 4.163673] no locks held by fsck.ext4/2072.
> [ 4.164318] Modules linked in:
> [ 4.164845] CPU: 0 PID: 2072 Comm: fsck.ext4 Not tainted 3.15.0-rc8-06047-gaaeb255 #1902
> [ 4.166047] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 4.166917] 00000000 00000000 f52c5ba0 c0832655 f5158610 f52c5bac c082f88a f6501e40
> [ 4.168188] f52c5c20 c08362ca c0eb3e40 c0eb3e40 374d3933 00000001 0396a8da 00000000
> [ 4.169474] f5158610 f51f1674 f4f46a00 f52c5be4 c015dd4b f4f46a00 f52c5bf0 c015dd5e
> [ 4.170781] Call Trace:
> [ 4.171159] [<c0832655>] dump_stack+0x48/0x60
> [ 4.171838] [<c082f88a>] __schedule_bug+0x5c/0x6d
> [ 4.172572] [<c08362ca>] __schedule+0x61/0x65a
> [ 4.173228] [<c015dd4b>] ? kvm_clock_read+0x1f/0x29
> [ 4.173977] [<c015dd5e>] ? kvm_clock_get_cycles+0x9/0xc
> [ 4.174771] [<c01b4cb9>] ? timekeeping_get_ns.constprop.14+0x10/0x56
> [ 4.175701] [<c0836922>] schedule+0x5f/0x61
> [ 4.176345] [<c0836aa2>] io_schedule+0x50/0x67
> [ 4.177060] [<c0423b2d>] bt_get+0xaf/0xd1
> [ 4.177677] [<c0198282>] ? wake_up_atomic_t+0x1f/0x1f
> [ 4.178444] [<c0423bfd>] blk_mq_get_tag+0x26/0x82
> [ 4.179158] [<c0420f14>] __blk_mq_alloc_request+0x2a/0x169
> [ 4.180022] [<c04222b5>] blk_mq_map_request+0x137/0x1e3
> [ 4.180825] [<c0422f89>] blk_sq_make_request+0x82/0x145
> [ 4.181630] [<c041a687>] generic_make_request+0x82/0xb5
> [ 4.182430] [<c041a7aa>] submit_bio+0xf0/0x109
> [ 4.183113] [<c019e97c>] ? trace_hardirqs_on_caller+0x14e/0x169
> [ 4.184019] [<c025de72>] _submit_bh+0x1ad/0x1ca
> [ 4.184661] [<c025de9e>] submit_bh+0xf/0x11
> [ 4.185267] [<c025f5c9>] block_read_full_page+0x1e2/0x1f2
> [ 4.186073] [<c025f8cd>] ? I_BDEV+0xa/0xa
> [ 4.186695] [<c020ad30>] ? __lru_cache_add+0x24/0x46
> [ 4.187452] [<c020af13>] ? lru_cache_add+0xd/0xf
> [ 4.188130] [<c025fc04>] blkdev_readpage+0x14/0x16
> [ 4.188832] [<c0209adf>] __do_page_cache_readahead+0x1c0/0x1eb
> [ 4.189704] [<c0209cb9>] ondemand_readahead+0x1af/0x1b9
> [ 4.190508] [<c0209d22>] page_cache_async_readahead+0x5f/0x6a
> [ 4.191424] [<c0202370>] generic_file_aio_read+0x226/0x4f4
> [ 4.192272] [<c0260841>] blkdev_aio_read+0x90/0x9e
> [ 4.193017] [<c02385cd>] do_sync_read+0x52/0x79
> [ 4.193731] [<c023857b>] ? fdput_pos+0x25/0x25
> [ 4.194412] [<c0238d27>] vfs_read+0x72/0xd1
> [ 4.195064] [<c02391da>] SyS_read+0x49/0x7c
> [ 4.195700] [<c083a0c9>] syscall_call+0x7/0xb
> [ 4.196385] [<c0830000>] ? print_usage_bug+0xcd/0x18e
>
> Is any of these known problems? This is blocking me from doing any
> kind of testing at the moment... (these problems are showing up while
> running KVM using virtio devices).

I believe you already reported this issue a while back, and it should be
fixed by commit cb96a42c in the kernel.

The other issue, not sure, not a lot of detail. It may be fixed by the
pull request I sent out yesterday. You can try pulling in:

git://git.kernel.dk/linux-block.git for-linus

and see if that fixes it for you.

--
Jens Axboe

2014-06-19 16:08:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic in blk_mq codepath?

On Thu, Jun 19, 2014 at 08:59:26AM -0700, Jens Axboe wrote:
> I believe you already reported this issue a while back, and it should be
> fixed by commit cb96a42c in the kernel.

Ah yes, I had forgotten. Thanks for the reminder!

> The other issue, not sure, not a lot of detail. It may be fixed by the pull
> request I sent out yesterday. You can try pulling in:
>
> git://git.kernel.dk/linux-block.git for-linus

Thanks, I'll give that a try.

- Ted

2014-06-19 16:21:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic in blk_mq codepath?

On Thu, Jun 19, 2014 at 12:08:01PM -0400, Theodore Ts'o wrote:
> > The other issue, not sure, not a lot of detail. It may be fixed by the pull
> > request I sent out yesterday. You can try pulling in:
> >
> > git://git.kernel.dk/linux-block.git for-linus
>
> Thanks, I'll give that a try.

I tried merging in your for-linus branch in v3.16-rc1, and I'm seeing
the following. On a 32-bit x86 3.15 kernel, run: "mke2fs -t ext3
/dev/vdc" where /dev/vdc is a 5 gig virtio partition.

The boot 3.16-rc1 with linux-block.git's for_linus branch merged in.
(Or 3.16-rc1 by itself):

root@candygram:~# e2fsck -fy /dev/vdc
e2fsck 1.43-WIP (4-Feb-2014)
Pass 1: Checking inodes, blocks, and sizes
[ 22.872217] random: nonblocking pool is initialized
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Error reading block 262144 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error? yes

Force rewrite? yes

Block bitmap differences: +(262144--262657)
Fix? yes


/dev/vdc: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vdc: 11/327680 files (0.0% non-contiguous), 55935/1310720 blocks

With the 3.15 kernel, this works fine.

- Ted

2014-06-19 22:38:26

by Dave Chinner

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic in blk_mq codepath?

On Thu, Jun 19, 2014 at 12:21:44PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 19, 2014 at 12:08:01PM -0400, Theodore Ts'o wrote:
> > > The other issue, not sure, not a lot of detail. It may be fixed by the pull
> > > request I sent out yesterday. You can try pulling in:
> > >
> > > git://git.kernel.dk/linux-block.git for-linus
> >
> > Thanks, I'll give that a try.
>
> I tried merging in your for-linus branch in v3.16-rc1, and I'm seeing
> the following. On a 32-bit x86 3.15 kernel, run: "mke2fs -t ext3
> /dev/vdc" where /dev/vdc is a 5 gig virtio partition.

Short reads are more likely a bug in all the iovec iterator stuff
that got merged in from the vfs tree. ISTR a 32 bit-only bug in that
stuff go past in to do with not being able to partition a 32GB block
dev on a 32 bit system due to a 32 bit size_t overflow somewhere....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-06-21 03:51:54

by Theodore Ts'o

[permalink] [raw]
Subject: 32-bit bug in iovec iterator changes

On Fri, Jun 20, 2014 at 08:38:20AM +1000, Dave Chinner wrote:
>
> Short reads are more likely a bug in all the iovec iterator stuff
> that got merged in from the vfs tree. ISTR a 32 bit-only bug in that
> stuff go past in to do with not being able to partition a 32GB block
> dev on a 32 bit system due to a 32 bit size_t overflow somewhere....

Dave Chinner called it.

Al, I'm seeing a regression which shows up using a 32-bit x86 kernel.
The symptoms of the bug is when run under KVM, with a 5 GB /dev/vdc
virtual block device, a read at offset 2 ** 30 fails with a short
read:

# dd if=/dev/vdc of=/dev/null bs=4k skip=262144 count=1
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0164144 s, 0.0 kB/s

On a 3.15 kernel, this command works:

# dd if=/dev/vdc of=/dev/null bs=4k skip=262144 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0457984 s, 89.4 kB/s

I tried bisecting it, but unfortunately the iovec iterator changes are
not cleanly bisectable, since copy_page_from_iter() gets introduced
some two dozen patches before it gets defined. :-(

However, the bisect leads quite squarely to to the iovec iterator
patches.

Al, I'd appreciate it if you could take a look?

Thanks!!

- Ted


% git bisect start
# good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1
git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f
# good: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next
git bisect good aaeb2554337217dfa4eac2fcc90da7be540b9a73
# bad: [16b9057804c02e2d351e9c8f606e909b43cbd9e7] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect bad 16b9057804c02e2d351e9c8f606e909b43cbd9e7
# good: [82abb273d838318424644d8f02825db0fbbd400a] Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
git bisect good 82abb273d838318424644d8f02825db0fbbd400a
# good: [d1e1cda862c16252087374ac75949b0e89a5717e] Merge tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
git bisect good d1e1cda862c16252087374ac75949b0e89a5717e
# good: [23d4ed53b7342bf5999b3ea227d9f69e75e5a625] Merge branch 'for-linus' of git://git.kernel.dk/linux-block
git bisect good 23d4ed53b7342bf5999b3ea227d9f69e75e5a625
# good: [2840c566e95599cd60c7143762ca8b49d9395050] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect good 2840c566e95599cd60c7143762ca8b49d9395050
# good: [4251c2a67011801caecd63671f26dd8c9aedb24c] Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
git bisect good 4251c2a67011801caecd63671f26dd8c9aedb24c
# skip: [3dae8750c368f8ac11c3c8c2a28f56dcee865c01] cifs: switch to ->write_iter()
git bisect skip 3dae8750c368f8ac11c3c8c2a28f56dcee865c01
# good: [5c02c392cd2320e8d612376d6b72b6548a680923] Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
git bisect good 5c02c392cd2320e8d612376d6b72b6548a680923
# bad: [5f073850602084fbcbb987948ff3e70ae273f7d2] kill generic_file_splice_write()
git bisect bad 5f073850602084fbcbb987948ff3e70ae273f7d2
# good: [38583f095c5a8138ae2a1c9173d0fd8a9f10e8aa] Merge branch 'akpm' (incoming from Andrew)
git bisect good 38583f095c5a8138ae2a1c9173d0fd8a9f10e8aa
# good: [f5ccfe1ddbaf9d923a3ebdadcb1e5e32d83e9c28] ext4: fix locking for O_APPEND writes
git bisect good f5ccfe1ddbaf9d923a3ebdadcb1e5e32d83e9c28
# bad: [f0d1bec9d58d4c038d0ac958c9af82be6eb18045] new helper: copy_page_from_iter()
git bisect bad f0d1bec9d58d4c038d0ac958c9af82be6eb18045


% (unset DISPLAY; git bisect visualize)
commit f0d1bec9d58d4c038d0ac958c9af82be6eb18045
Author: Al Viro <[email protected]>
Date: Thu Apr 3 15:05:18 2014 -0400

new helper: copy_page_from_iter()

parallel to copy_page_to_iter(). pipe_write() switched to it (and became
->write_iter()).

Signed-off-by: Al Viro <[email protected]>

commit 84c3d55cc474f9c234c023c92e2769f940d5548c
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:33:23 2014 -0400

fuse: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit b30ac0fc4109701fc122d41ee085c65b52dc44a3
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:29:04 2014 -0400

btrfs: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit 3ef045c3d8ae8550abbfd44074efce6ff642cc86
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:25:22 2014 -0400

ocfs2: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit bf97f3bc0c32140c43fe5ca53d23514ea46a54ca
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:20:23 2014 -0400

xfs: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit 50b5551d1719c8bce60c6d4027b814cfc72c2307
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:13:46 2014 -0400

afs: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit da56e45b6ee83b67a586c61819cd2b5cfd806eb8
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:11:01 2014 -0400

gfs2: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit edaf43694898c5d7deb9a394335c60e888039100
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:07:25 2014 -0400

nfs: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit f5674c31ee1f968606702e82c160d6ae11032ded
Author: Al Viro <[email protected]>
Date: Thu Apr 3 14:00:23 2014 -0400

ubifs: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit a8f3550cd228b6edc5d17fce1a9af8cc7004f185
Author: Al Viro <[email protected]>
Date: Thu Apr 3 03:32:25 2014 -0400

bury __generic_file_aio_write()

all users converted to __generic_file_write_iter() now

Signed-off-by: Al Viro <[email protected]>

commit 3dae8750c368f8ac11c3c8c2a28f56dcee865c01
Author: Al Viro <[email protected]>
Date: Thu Apr 3 12:05:17 2014 -0400

cifs: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit d4637bc18f3bf89bd63fe7b967043523aa3a6170
Author: Al Viro <[email protected]>
Date: Thu Apr 3 03:31:17 2014 -0400

udf: switch to ->write_iter()

Signed-off-by: Al Viro <[email protected]>

commit 9b884164d59707216840159d45f6be68073fac6e
Author: Al Viro <[email protected]>
Date: Thu Apr 17 16:09:22 2014 -0400

convert ext4 to ->write_iter()

unfortunately, Ted's changes to ext4_file_write() are *still* an
incomplete fix - playing with rlimits can let you smuggle an
unaligned request past the checks. So there almost certainly
will be more merge PITA around that place...

[fix from Peter Ujfalusi <[email protected]> folded]

Signed-off-by: Al Viro <[email protected]>

commit a8324754889c0a491b216bc0502ef9ba557eeac7
Merge: 1456c0a f5ccfe1
Author: Al Viro <[email protected]>
Date: Tue May 6 17:38:41 2014 -0400

Merge ext4 changes in ext4_file_write() into for-next

From ext4.git#dev, needed for switch of ext4 to ->write_iter() ;-/

commit 1456c0a87c4241d3a801651019e66983c69ad17d
Author: Al Viro <[email protected]>
Date: Thu Apr 3 03:21:50 2014 -0400

blkdev_aio_write() - turn into blkdev_write_iter()

Signed-off-by: Al Viro <[email protected]>

commit 8174202b34c30e0c07231bf63f18ab29af634f0b
Author: Al Viro <[email protected]>
Date: Thu Apr 3 03:17:43 2014 -0400

write_iter variants of {__,}generic_file_aio_write()

Signed-off-by: Al Viro <[email protected]>

commit 3644424dc6309439c4c8d97590cdac4100376255
Author: Al Viro <[email protected]>
Date: Wed Apr 2 20:28:01 2014 -0400

ceph: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit 3aa2d199f8eb8149a88005e88736d583cbc39d31
Author: Al Viro <[email protected]>
Date: Wed Apr 2 20:14:12 2014 -0400

nfs: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit a886038baa48a17f2cf77fb5dcc204fd28659d8f
Author: Al Viro <[email protected]>
Date: Wed Apr 2 20:02:21 2014 -0400

fs/block_dev.c: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit 2ba5bbed0cd7429dbd567fa885ae3bc7a76de3d4
Author: Al Viro <[email protected]>
Date: Wed Apr 2 20:00:02 2014 -0400

shmem: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit fb9096a344e2964c6a71520931c08abb1301248e
Author: Al Viro <[email protected]>
Date: Wed Apr 2 19:56:54 2014 -0400

pipe: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit e6a7bcb4c489e3e078ba3cc92ae6621b2b8bb9a7
Author: Al Viro <[email protected]>
Date: Wed Apr 2 19:53:36 2014 -0400

cifs: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit 37c20f16e7a73e5fe34815e785ca6c5a46e4d260
Author: Al Viro <[email protected]>
Date: Wed Apr 2 14:47:09 2014 -0400

fuse_file_aio_read(): convert to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit 3cd9ad5a303a0d503492002c4af95becfa99af03
Author: Al Viro <[email protected]>
Date: Wed Apr 2 14:44:18 2014 -0400

ocfs2: switch to ->read_iter()

tracepoints are evil, exhibit #6969...

Signed-off-by: Al Viro <[email protected]>

commit 027978295d455b2fdff0cb36570f83ada7385ea6
Author: Al Viro <[email protected]>
Date: Wed Apr 2 14:40:38 2014 -0400

ecryptfs: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit b4f5d2c6d1f88c79e48f1296076b3a6a22f58c0f
Author: Al Viro <[email protected]>
Date: Wed Apr 2 14:37:59 2014 -0400

xfs: switch to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit aad4f8bb42af06371aa0e85bf0cd9d52c0494985
Author: Al Viro <[email protected]>
Date: Wed Apr 2 14:33:16 2014 -0400

switch simple generic_file_aio_read() users to ->read_iter()

Signed-off-by: Al Viro <[email protected]>

commit 293bc9822fa9b3c9d4b7893bcb241e085580771a
Author: Al Viro <[email protected]>
Date: Tue Feb 11 18:37:41 2014 -0500

new methods: ->read_iter() and ->write_iter()

Beginning to introduce those. Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <[email protected]> folded]

Signed-off-by: Al Viro <[email protected]>

commit 7f7f25e82d54870df24d415a7007fbd327da027b
Author: Al Viro <[email protected]>
Date: Tue Feb 11 17:49:24 2014 -0500

replace checking for ->read/->aio_read presence with check in ->f_mode

Since we are about to introduce new methods (read_iter/write_iter), the
tests in a bunch of places would have to grow inconveniently. Check
once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
and FMODE_CAN_WRITE resp. It might end up being a temporary measure -
once everything switches from ->aio_{read,write} to ->{read,write}_iter
it might make sense to return to open-coded checks. We'll see...

Signed-off-by: Al Viro <[email protected]>

commit b318891929c2750055a4002bee3e7636ca3684de
Author: Al Viro <[email protected]>
Date: Wed Apr 2 07:06:30 2014 -0400

xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()

pos is redundant (it's iocb->ki_pos), and iov/nr_segs/count are taken
care of by lifting iov_iter into the caller.

Signed-off-by: Al Viro <[email protected]>

commit 37938463540b075e9166cf774c59274379f7a8ca
Author: Al Viro <[email protected]>
Date: Sat Mar 22 06:57:37 2014 -0400

blkdev_aio_read(): switch to generic_file_read_iter(), get rid of iov_shorten()

Signed-off-by: Al Viro <[email protected]>

commit 0c949334a9e2581646c6ff0d1470a805b1e5be99
Author: Al Viro <[email protected]>
Date: Sat Mar 22 06:51:37 2014 -0400

iov_iter_truncate()

Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.

Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.

Signed-off-by: Al Viro <[email protected]>

commit 28060d5d9b261da110afe48aae7a2aa6555f798f
Author: Al Viro <[email protected]>
Date: Sat Mar 22 05:15:17 2014 -0400

btrfs: switch check_direct_IO() to iov_iter

... and don't open-code iov_iter_alignment() there

Signed-off-by: Al Viro <[email protected]>

commit 91f79c43d1b54d7154b118860d81b39bad07dfff
Author: Al Viro <[email protected]>
Date: Fri Mar 21 04:58:33 2014 -0400

new helper: iov_iter_get_pages_alloc()

same as iov_iter_get_pages(), except that pages array is allocated
(kmalloc if possible, vmalloc if that fails) and left for caller to
free. Lustre and NFS ->direct_IO() switched to it.

Signed-off-by: Al Viro <[email protected]>

commit f67da30c1d5fc9e341bc8121708874bfd7b31e45
Author: Al Viro <[email protected]>
Date: Wed Mar 19 01:16:16 2014 -0400

new helper: iov_iter_npages()

counts the pages covered by iov_iter, up to given limit.
do_block_direct_io() and fuse_iter_npages() switched to
it.

Signed-off-by: Al Viro <[email protected]>

commit 5b46f25ddc6edf4adff1a137d453da542c27a640
Author: Al Viro <[email protected]>
Date: Sun Mar 16 18:07:34 2014 -0400

f2fs: switch to iov_iter_alignment()

Signed-off-by: Al Viro <[email protected]>

commit c9c37e2e63786c595d704244cbb7d19dc5630493
Author: Al Viro <[email protected]>
Date: Sun Mar 16 16:08:30 2014 -0400

fuse: switch to iov_iter_get_pages()

Signed-off-by: Al Viro <[email protected]>

commit d22a943f44c79c98ac7a93653fdd330378581741
Author: Al Viro <[email protected]>
Date: Sun Mar 16 15:50:47 2014 -0400

fuse: pull iov_iter initializations up

... to fuse_direct_{read,write}(). ->direct_IO() path uses the
iov_iter passed by the caller instead.

Signed-off-by: Al Viro <[email protected]>

commit 7b2c99d15559e285384c742db52316802e24b0bd
Author: Al Viro <[email protected]>
Date: Sat Mar 15 04:05:57 2014 -0400

new helper: iov_iter_get_pages()

iov_iter_get_pages(iter, pages, maxsize, &start) grabs references pinning
the pages of up to maxsize of (contiguous) data from iter. Returns the
amount of memory grabbed or -error. In case of success, the requested
area begins at offset start in pages[0] and runs through pages[1], etc.
Less than requested amount might be returned - either because the contiguous
area in the beginning of iterator is smaller than requested, or because
the kernel failed to pin that many pages.

direct-io.c switched to using iov_iter_get_pages()

Signed-off-by: Al Viro <[email protected]>

commit 3320c60b3a26d05666285c55ab08ee043c017ba3
Author: Al Viro <[email protected]>
Date: Mon Mar 10 02:30:55 2014 -0400

dio: take updating ->result into do_direct_IO()

Signed-off-by: Al Viro <[email protected]>

commit 71d8e532b1549a478e6a6a8a44f309d050294d00
Author: Al Viro <[email protected]>
Date: Wed Mar 5 19:28:09 2014 -0500

start adding the tag to iov_iter

For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment. Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.

Signed-off-by: Al Viro <[email protected]>

commit ed978a811ec528dbe40243605c3afab55892f722
Author: Al Viro <[email protected]>
Date: Wed Mar 5 22:53:04 2014 -0500

new helper: generic_file_read_iter()

iov_iter-using variant of generic_file_aio_read(). Some callers
converted. Note that it's still not quite there for use as ->read_iter() -
we depend on having zero iter->iov_offset in O_DIRECT case. Fortunately,
that's true for all converted callers (and for generic_file_aio_read() itself).

Signed-off-by: Al Viro <[email protected]>

commit 23faa7b8db9be0be4f158cfc558460bb95d9b245
Author: Al Viro <[email protected]>
Date: Wed Mar 5 22:52:34 2014 -0500

fuse_file_aio_write(): merge initializations of iov_iter

Signed-off-by: Al Viro <[email protected]>

commit 05bb2e0bc77cb005248be318d2b0ba369b8bbab3
Author: Al Viro <[email protected]>
Date: Wed Mar 5 19:22:23 2014 -0500

ceph_aio_read(): keep iov_iter across retries

Signed-off-by: Al Viro <[email protected]>

commit 886a39115005ced8b15ab067c9c2a8d546b40a5e
Author: Al Viro <[email protected]>
Date: Wed Mar 5 13:50:45 2014 -0500

new primitive: iov_iter_alignment()

returns the value aligned as badly as the worst remaining segment
in iov_iter is. Use instead of open-coded equivalents.

Signed-off-by: Al Viro <[email protected]>

commit 26978b8b4d83c46f4310b253db70fa9e65149e7c
Author: Al Viro <[email protected]>
Date: Mon Mar 10 14:08:45 2014 -0400

give ->direct_IO() a copy of iov_iter

the thing is, we want to advance what's given to ->direct_IO() as we
are forming the request; however, the callers care about the amount
of data actually transferred, not the amount we tried to transfer.
It's more convenient to allow ->direct_IO() instances do use
iov_iter_advance() on the copy of iov_iter, leaving the actual
advancing of the original to caller.

Signed-off-by: Al Viro <[email protected]>

commit 31b140398ce56ab41646eda7f02bcb78d6a4c916
Author: Al Viro <[email protected]>
Date: Wed Mar 5 01:33:16 2014 -0500

switch {__,}blockdev_direct_IO() to iov_iter

Signed-off-by: Al Viro <[email protected]>

commit a6cbcd4a4a85e2fdb0b3344b88df2e8b3d526b9e
Author: Al Viro <[email protected]>
Date: Tue Mar 4 22:38:00 2014 -0500

get rid of pointless iov_length() in ->direct_IO()

all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <[email protected]>

commit 16b1f05d7f5ab4ce570963aca5f3b2b5d21822fa
Author: Al Viro <[email protected]>
Date: Tue Mar 4 22:14:00 2014 -0500

ext4: switch the guts of ->direct_IO() to iov_iter

Signed-off-by: Al Viro <[email protected]>

commit 619d30b4b8c488042b4a720ca79dccc346d1a516
Author: Al Viro <[email protected]>
Date: Tue Mar 4 21:53:33 2014 -0500

convert the guts of nfs_direct_IO() to iov_iter

Signed-off-by: Al Viro <[email protected]>

commit d8d3d94b80aa1a1c0ca75c58b8abdc7356f38418
Author: Al Viro <[email protected]>
Date: Tue Mar 4 21:27:34 2014 -0500

pass iov_iter to ->direct_IO()

unmodified, for now

Signed-off-by: Al Viro <[email protected]>

commit cb66a7a1f149ff705fa37cad6d1252b046e0ad4f
Author: Al Viro <[email protected]>
Date: Tue Mar 4 15:24:06 2014 -0500

kill generic_segment_checks()

all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().

Signed-off-by: Al Viro <[email protected]>

commit 0ae5e4d370599592eab845527b31708a4f3411be
Author: Al Viro <[email protected]>
Date: Mon Mar 3 22:09:39 2014 -0500

__btrfs_direct_write(): switch to iov_iter

Signed-off-by: Al Viro <[email protected]>

commit f8579f8673b7ecdb7a81d5d5bb1d981093d9aa94
Author: Al Viro <[email protected]>
Date: Mon Mar 3 22:03:20 2014 -0500

generic_file_direct_write(): switch to iov_iter

Signed-off-by: Al Viro <[email protected]>

commit e7c24607b5d68a4cdc56e09d70a3c8bae5f0519f
Author: Al Viro <[email protected]>
Date: Thu Apr 10 20:54:51 2014 -0400

kill iov_iter_copy_from_user()

all callers can use copy_page_from_iter() and it actually simplifies
them.

Signed-off-by: Al Viro <[email protected]>

commit f6c0a1920e0180175bd5e8e4aff8ea5556f1895d
Author: Al Viro <[email protected]>
Date: Wed Apr 23 10:18:46 2014 -0400

fs/file.c: don't open-code kvfree()

Signed-off-by: Al Viro <[email protected]>

2014-06-21 05:53:17

by Al Viro

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Fri, Jun 20, 2014 at 11:51:44PM -0400, Theodore Ts'o wrote:
> On Fri, Jun 20, 2014 at 08:38:20AM +1000, Dave Chinner wrote:
> >
> > Short reads are more likely a bug in all the iovec iterator stuff
> > that got merged in from the vfs tree. ISTR a 32 bit-only bug in that
> > stuff go past in to do with not being able to partition a 32GB block
> > dev on a 32 bit system due to a 32 bit size_t overflow somewhere....
>
> Dave Chinner called it.
>
> Al, I'm seeing a regression which shows up using a 32-bit x86 kernel.
> The symptoms of the bug is when run under KVM, with a 5 GB /dev/vdc
> virtual block device, a read at offset 2 ** 30 fails with a short
> read:
>
> # dd if=/dev/vdc of=/dev/null bs=4k skip=262144 count=1
> 0+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.0164144 s, 0.0 kB/s

Argh...

ed include/linux/uio.h <<EOF
/iov_iter_truncate/s/size_t/u64/
w
q
EOF

Could you check if that fixes the sucker?

2014-06-21 23:09:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sat, Jun 21, 2014 at 06:53:07AM +0100, Al Viro wrote:
>
> ed include/linux/uio.h <<EOF
> /iov_iter_truncate/s/size_t/u64/
> w
> q
> EOF
>
> Could you check if that fixes the sucker?

The following patch (attached at the end) appears to fix the problem,
but looking at uio.h, I'm completely confused about *why* it fixes the
problem. In particular, iov_iter_iovec() makes no sense to me at all,
and I don't understand how the calculation of iov_len makes any sense:

.iov_len = min(iter->count,
iter->iov->iov_len - iter->iov_offset),

It also looks like uio.h is mostly about offsets to memory pointers,
and so why this would make a difference when the issue is the block
device offset goes above 2**30?

There must be deep magic going on here, and so I don't know if your
s/size_t/u64/g substitation also extends to the various functions that
have size_t in them:

unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to);
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
size_t iov_iter_single_seg_count(const struct iov_iter *i);
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
unsigned long iov_iter_alignment(const struct iov_iter *i);
void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, size_t *start);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);


Anyway, this patch does appear to make the problem go away, but given
that I don't understand what is going on here, please take it with a
huge grain of salt. And might I suggest some comments to perhaps give
some context to someone who is trying to understand
include/linux/uio.h?

Thanks!!

- Ted

diff --git a/include/linux/uio.h b/include/linux/uio.h
index e2231e4..bea7b7d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -16,7 +16,7 @@ struct page;

struct kvec {
void *iov_base; /* and that should *never* hold a userland pointer */
- size_t iov_len;
+ u64 iov_len;
};

enum {
@@ -27,8 +27,8 @@ enum {

struct iov_iter {
int type;
- size_t iov_offset;
- size_t count;
+ u64 iov_offset;
+ u64 count;
union {
const struct iovec *iov;
const struct bio_vec *bvec;
@@ -41,12 +41,12 @@ struct iov_iter {
*
* NOTE that it is not safe to use this function until all the iovec's
* segment lengths have been validated. Because the individual lengths can
- * overflow a size_t when added together.
+ * overflow a u64 when added together.
*/
-static inline size_t iov_length(const struct iovec *iov, unsigned long nr_segs)
+static inline u64 iov_length(const struct iovec *iov, unsigned long nr_segs)
{
unsigned long seg;
- size_t ret = 0;
+ u64 ret = 0;

for (seg = 0; seg < nr_segs; seg++)
ret += iov[seg].iov_len;
@@ -89,12 +89,12 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);

-static inline size_t iov_iter_count(struct iov_iter *i)
+static inline u64 iov_iter_count(struct iov_iter *i)
{
return i->count;
}

-static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
+static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
{
if (i->count > count)
i->count = count;
@@ -104,7 +104,7 @@ static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
* reexpand a previously truncated iterator; count must be no more than how much
* we had shrunk it.
*/
-static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
+static inline void iov_iter_reexpand(struct iov_iter *i, u64 count)
{
i->count = count;
}

2014-06-21 23:49:30

by Al Viro

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sat, Jun 21, 2014 at 07:09:22PM -0400, Theodore Ts'o wrote:
> On Sat, Jun 21, 2014 at 06:53:07AM +0100, Al Viro wrote:
> >
> > ed include/linux/uio.h <<EOF
> > /iov_iter_truncate/s/size_t/u64/
> > w
> > q
> > EOF
> >
> > Could you check if that fixes the sucker?
>
> The following patch (attached at the end) appears to fix the problem,
> but looking at uio.h, I'm completely confused about *why* it fixes the
> problem. In particular, iov_iter_iovec() makes no sense to me at all,
> and I don't understand how the calculation of iov_len makes any sense:
>
> .iov_len = min(iter->count,
> iter->iov->iov_len - iter->iov_offset),

Eh? We have iov[0].iov_base..iov[0].iov_base+iov[0].iov_len - 1 for
area covered by the first iovec. First iov_offset bytes have already
been consumed. And at most count bytes matter. So yes, this iov_len
will give you equivalent first iovec.

Said that, iov_iter_iovec() will die shortly - it's a rudiment of older
code, with almost no users left. But yes, it is correct.

> It also looks like uio.h is mostly about offsets to memory pointers,
> and so why this would make a difference when the issue is the block
> device offset goes above 2**30?

It is, and your patch is a huge overkill.

> There must be deep magic going on here, and so I don't know if your
> s/size_t/u64/g substitation also extends to the various functions that
> have size_t in them:

No, it does not. It's specifically about iov_iter_truncate(); moreover,
it matters to only one caller of that sucker. Namely,

static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
struct inode *bd_inode = file->f_mapping->host;
loff_t size = i_size_read(bd_inode);
loff_t pos = iocb->ki_pos;

if (pos >= size)
return 0;

size -= pos;
iov_iter_truncate(to, size);
return generic_file_read_iter(iocb, to);
}

What happens here is capping to->count, to guarantee that we won't even look
at anything past the end of block device. Alternative fix would be to
have
if (pos >= size)
return 0;
if (to->size + pos > size) {
/* note that size - pos fits into size_t in this case,
* so it's OK to pass it to iov_iter_truncate().
*/
iov_iter_truncate(to, size - pos);
}
return generic_file_read_iter(iocb, to);
in there. Other callers are passing it size_t values already, so we don't
need similar checks there.

Or we can make iov_iter_truncate() take an arbitrary u64 argument, seeing that
it's inlined anyway. IMO it's more robust that way...

Anyway, does the following alone fix the problem you are seeing?

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ddfdb53..dbb02d4 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
return i->count;
}

-static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
+static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
{
if (i->count > count)
i->count = count;

2014-06-22 00:03:23

by James Bottomley

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sun, 2014-06-22 at 00:49 +0100, Al Viro wrote:
> On Sat, Jun 21, 2014 at 07:09:22PM -0400, Theodore Ts'o wrote:
> > On Sat, Jun 21, 2014 at 06:53:07AM +0100, Al Viro wrote:
> > >
> > > ed include/linux/uio.h <<EOF
> > > /iov_iter_truncate/s/size_t/u64/
> > > w
> > > q
> > > EOF
> > >
> > > Could you check if that fixes the sucker?
> >
> > The following patch (attached at the end) appears to fix the problem,
> > but looking at uio.h, I'm completely confused about *why* it fixes the
> > problem. In particular, iov_iter_iovec() makes no sense to me at all,
> > and I don't understand how the calculation of iov_len makes any sense:
> >
> > .iov_len = min(iter->count,
> > iter->iov->iov_len - iter->iov_offset),
>
> Eh? We have iov[0].iov_base..iov[0].iov_base+iov[0].iov_len - 1 for
> area covered by the first iovec. First iov_offset bytes have already
> been consumed. And at most count bytes matter. So yes, this iov_len
> will give you equivalent first iovec.
>
> Said that, iov_iter_iovec() will die shortly - it's a rudiment of older
> code, with almost no users left. But yes, it is correct.
>
> > It also looks like uio.h is mostly about offsets to memory pointers,
> > and so why this would make a difference when the issue is the block
> > device offset goes above 2**30?
>
> It is, and your patch is a huge overkill.
>
> > There must be deep magic going on here, and so I don't know if your
> > s/size_t/u64/g substitation also extends to the various functions that
> > have size_t in them:
>
> No, it does not. It's specifically about iov_iter_truncate(); moreover,
> it matters to only one caller of that sucker. Namely,
>
> static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> struct file *file = iocb->ki_filp;
> struct inode *bd_inode = file->f_mapping->host;
> loff_t size = i_size_read(bd_inode);
> loff_t pos = iocb->ki_pos;
>
> if (pos >= size)
> return 0;
>
> size -= pos;
> iov_iter_truncate(to, size);
> return generic_file_read_iter(iocb, to);
> }
>
> What happens here is capping to->count, to guarantee that we won't even look
> at anything past the end of block device. Alternative fix would be to
> have
> if (pos >= size)
> return 0;
> if (to->size + pos > size) {
> /* note that size - pos fits into size_t in this case,
> * so it's OK to pass it to iov_iter_truncate().
> */
> iov_iter_truncate(to, size - pos);
> }
> return generic_file_read_iter(iocb, to);
> in there. Other callers are passing it size_t values already, so we don't
> need similar checks there.
>
> Or we can make iov_iter_truncate() take an arbitrary u64 argument, seeing that
> it's inlined anyway. IMO it's more robust that way...
>
> Anyway, does the following alone fix the problem you are seeing?
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index ddfdb53..dbb02d4 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
> return i->count;
> }
>
> -static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
> +static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
> {
> if (i->count > count)
> i->count = count;

Al, how can that work? i->count is size_t, which is 32 bit, so we're
going to get truncation errors. I could see this possibly working if
count in struct iov_iter becomes u64 (which is going to have a lot of
knock on consequences, but it seems to me that at least kvec.iov_len and
iov_iter.iov_offset have to become u64 as well.

James

2014-06-22 00:26:27

by Al Viro

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sat, Jun 21, 2014 at 05:03:20PM -0700, James Bottomley wrote:

> > Anyway, does the following alone fix the problem you are seeing?
> >
> > diff --git a/include/linux/uio.h b/include/linux/uio.h
> > index ddfdb53..dbb02d4 100644
> > --- a/include/linux/uio.h
> > +++ b/include/linux/uio.h
> > @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
> > return i->count;
> > }
> >
> > -static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
> > +static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
> > {
> > if (i->count > count)
> > i->count = count;
>
> Al, how can that work? i->count is size_t, which is 32 bit, so we're
> going to get truncation errors.

No, we are not. Look:
* comparison promotes both operands to u64 here, so its result is
accurate, no matter how large count is. They are compared as natural
numbers.
* assignment converts count to size_t, which *would* truncate for
values that are greater than the maximal value representable by size_t.
But in that case it's by definition greater than i->count, so we do not
reach that assignment at all.

2014-06-22 00:32:48

by James Bottomley

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sun, 2014-06-22 at 01:26 +0100, Al Viro wrote:
> On Sat, Jun 21, 2014 at 05:03:20PM -0700, James Bottomley wrote:
>
> > > Anyway, does the following alone fix the problem you are seeing?
> > >
> > > diff --git a/include/linux/uio.h b/include/linux/uio.h
> > > index ddfdb53..dbb02d4 100644
> > > --- a/include/linux/uio.h
> > > +++ b/include/linux/uio.h
> > > @@ -94,7 +94,7 @@ static inline size_t iov_iter_count(struct iov_iter *i)
> > > return i->count;
> > > }
> > >
> > > -static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
> > > +static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
> > > {
> > > if (i->count > count)
> > > i->count = count;
> >
> > Al, how can that work? i->count is size_t, which is 32 bit, so we're
> > going to get truncation errors.
>
> No, we are not. Look:
> * comparison promotes both operands to u64 here, so its result is
> accurate, no matter how large count is. They are compared as natural
> numbers.

True ... figured this out 10 seconds after sending the email.

> * assignment converts count to size_t, which *would* truncate for
> values that are greater than the maximal value representable by size_t.
> But in that case it's by definition greater than i->count, so we do not
> reach that assignment at all.

OK, so what I still don't get is why isn't the compiler warning when we
truncate a u64 to a u32? We should get that warning in your new code,
and we should have got that warning in fs/block_dev.c where it would
have pinpointed the actual problem.

James

2014-06-22 00:54:01

by Al Viro

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sat, Jun 21, 2014 at 05:32:44PM -0700, James Bottomley wrote:
> > No, we are not. Look:
> > * comparison promotes both operands to u64 here, so its result is
> > accurate, no matter how large count is. They are compared as natural
> > numbers.
>
> True ... figured this out 10 seconds after sending the email.
>
> > * assignment converts count to size_t, which *would* truncate for
> > values that are greater than the maximal value representable by size_t.
> > But in that case it's by definition greater than i->count, so we do not
> > reach that assignment at all.
>
> OK, so what I still don't get is why isn't the compiler warning when we
> truncate a u64 to a u32? We should get that warning in your new code,
> and we should have got that warning in fs/block_dev.c where it would
> have pinpointed the actual problem.

In which universe?

extern void f(unsigned int);

void g(unsigned long x)
{
f(x);
}

is perfectly valid C, with no warnings in sight. f(1UL << 32) might
give one, but not this...

2014-06-22 01:00:37

by James Bottomley

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sun, 2014-06-22 at 01:53 +0100, Al Viro wrote:
> On Sat, Jun 21, 2014 at 05:32:44PM -0700, James Bottomley wrote:
> > > No, we are not. Look:
> > > * comparison promotes both operands to u64 here, so its result is
> > > accurate, no matter how large count is. They are compared as natural
> > > numbers.
> >
> > True ... figured this out 10 seconds after sending the email.
> >
> > > * assignment converts count to size_t, which *would* truncate for
> > > values that are greater than the maximal value representable by size_t.
> > > But in that case it's by definition greater than i->count, so we do not
> > > reach that assignment at all.
> >
> > OK, so what I still don't get is why isn't the compiler warning when we
> > truncate a u64 to a u32? We should get that warning in your new code,
> > and we should have got that warning in fs/block_dev.c where it would
> > have pinpointed the actual problem.
>
> In which universe?
>
> extern void f(unsigned int);
>
> void g(unsigned long x)
> {
> f(x);
> }

In the one where the code is compiled with -Wconversion ... I'm just
surprised, I thought we had this enabled.

James

2014-06-22 01:00:48

by Al Viro

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sun, Jun 22, 2014 at 01:53:52AM +0100, Al Viro wrote:
> On Sat, Jun 21, 2014 at 05:32:44PM -0700, James Bottomley wrote:
> > > No, we are not. Look:
> > > * comparison promotes both operands to u64 here, so its result is
> > > accurate, no matter how large count is. They are compared as natural
> > > numbers.
> >
> > True ... figured this out 10 seconds after sending the email.
> >
> > > * assignment converts count to size_t, which *would* truncate for
> > > values that are greater than the maximal value representable by size_t.
> > > But in that case it's by definition greater than i->count, so we do not
> > > reach that assignment at all.
> >
> > OK, so what I still don't get is why isn't the compiler warning when we
> > truncate a u64 to a u32? We should get that warning in your new code,
> > and we should have got that warning in fs/block_dev.c where it would
> > have pinpointed the actual problem.
>
> In which universe?
>
> extern void f(unsigned int);
>
> void g(unsigned long x)
> {
> f(x);
> }
>
> is perfectly valid C, with no warnings in sight. f(1UL << 32) might
> give one, but not this...

PS: I agree that it's worth careful commenting, obviously, but before sending
it to Linus (*with* comments) I want to get a confirmation that this one-liner
actually fixes what Ted is seeing. I have reproduced it here, and that change
makes the breakage go away in my testing, but I'd like to make sure that we are
seeing the same thing. Ted?

2014-06-22 11:50:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: 32-bit bug in iovec iterator changes

On Sun, Jun 22, 2014 at 02:00:32AM +0100, Al Viro wrote:
>
> PS: I agree that it's worth careful commenting, obviously, but
> before sending it to Linus (*with* comments) I want to get a
> confirmation that this one-liner actually fixes what Ted is seeing.
> I have reproduced it here, and that change makes the breakage go
> away in my testing, but I'd like to make sure that we are seeing the
> same thing. Ted?

Hep, that fixes things. Thanks for explaining what was going on!

- Ted

2014-06-23 07:44:55

by Al Viro

[permalink] [raw]
Subject: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes)

On Sun, Jun 22, 2014 at 07:50:07AM -0400, Theodore Ts'o wrote:
> On Sun, Jun 22, 2014 at 02:00:32AM +0100, Al Viro wrote:
> >
> > PS: I agree that it's worth careful commenting, obviously, but
> > before sending it to Linus (*with* comments) I want to get a
> > confirmation that this one-liner actually fixes what Ted is seeing.
> > I have reproduced it here, and that change makes the breakage go
> > away in my testing, but I'd like to make sure that we are seeing the
> > same thing. Ted?
>
> Hep, that fixes things. Thanks for explaining what was going on!

OK, here it is, hopefully with sufficient comments:

blkdev_read_iter() wants to cap the iov_iter by the amount of
data remaining to the end of device. That's what iov_iter_truncate()
is for (trim iter->count if it's above the given limit). So far,
so good, but the argument of iov_iter_truncate() is size_t, so on
32bit boxen (in case of a large device) we end up with that upper
limit truncated down to 32 bits *before* comparing it with iter->count.

Easily fixed by making iov_iter_truncate() take 64bit argument -
it does the right thing after such change (we only reach the
assignment in there when the current value of iter->count is greater
than the limit, i.e. for anything that would get truncated we don't
reach the assignment at all) and that argument is not the new
value of iter->count - it's an upper limit for such.

The overhead of passing u64 is not an issue - the thing is inlined,
so callers passing size_t won't pay any penalty.

Reported-by: Theodore Tso <[email protected]>
Tested-by: Theodore Tso <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ddfdb53..17ae7e3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -94,8 +94,20 @@ static inline size_t iov_iter_count(struct iov_iter *i)
return i->count;
}

-static inline void iov_iter_truncate(struct iov_iter *i, size_t count)
+/*
+ * Cap the iov_iter by given limit; note that the second argument is
+ * *not* the new size - it's upper limit for such. Passing it a value
+ * greater than the amount of data in iov_iter is fine - it'll just do
+ * nothing in that case.
+ */
+static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
{
+ /*
+ * count doesn't have to fit in size_t - comparison extends both
+ * operands to u64 here and any value that would be truncated by
+ * conversion in assignement is by definition greater than all
+ * values of size_t, including old i->count.
+ */
if (i->count > count)
i->count = count;
}

2014-06-23 15:43:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes)

On Mon, Jun 23, 2014 at 08:44:40AM +0100, Al Viro wrote:
>
> OK, here it is, hopefully with sufficient comments:

The comments look really good. I assume you'll get this to
Linus in time for 3.16-rc3?

Many thanks!!

- Ted

2014-06-24 12:34:55

by Alan Cox

[permalink] [raw]
Subject: Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes)

On Mon, 23 Jun 2014 11:43:02 -0400
"Theodore Ts'o" <[email protected]> wrote:

> On Mon, Jun 23, 2014 at 08:44:40AM +0100, Al Viro wrote:
> >
> > OK, here it is, hopefully with sufficient comments:
>
> The comments look really good. I assume you'll get this to
> Linus in time for 3.16-rc3?

Fixes the 32GB 'can't partition' bug I reported too.

Alan

2014-06-25 16:56:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes)

Al,
just checking - did you expect me to take this from the email, or are
you preparing a pull request?

Linus

On Mon, Jun 23, 2014 at 12:44 AM, Al Viro <[email protected]> wrote:
>
> OK, here it is, hopefully with sufficient comments:

2014-06-26 15:35:19

by Bruno Wolff III

[permalink] [raw]
Subject: Re: [regression] fix 32-bit breakage in block device read(2) (was Re: 32-bit bug in iovec iterator changes)

On Mon, Jun 23, 2014 at 08:44:40 +0100,
Al Viro <[email protected]> wrote:
>
>blkdev_read_iter() wants to cap the iov_iter by the amount of
>data remaining to the end of device. That's what iov_iter_truncate()
>is for (trim iter->count if it's above the given limit). So far,
>so good, but the argument of iov_iter_truncate() is size_t, so on
>32bit boxen (in case of a large device) we end up with that upper
>limit truncated down to 32 bits *before* comparing it with iter->count.

This seems to fix a problem I had
(https://bugzilla.kernel.org/show_bug.cgi?id=78711) with a partition device
(/dev/sda3) being zero size on 3.16 kernels. However I am having some
other issues with 3.16 on i686 and the amount of testing was the raid
array using /dev/sda3 appeared to start (which it hadn't previously), but
the system hung before finishing the boot process.