splice stuff. There are conflicts in lustre; proposed resolution
is in #merge-candidate (same as it is in linux-next). There's a bunch of
branches this cycle, both mine and from other folks and I'd rather send
pull requests separately. This one is the conversion of ->splice_read()
to ITER_PIPE iov_iter (and introduction of such). Gets rid of a lot of
code in fs/splice.c and elsewhere; there will be followups, but these are
for the next cycle... Some pipe/splice-related cleanups from Miklos in
the same branch as well.
The following changes since commit 08895a8b6b06ed2323cd97a36ee40a116b3db8ed:
Linux 4.8-rc8 (2016-09-25 18:47:13 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.splice_read
for you to fetch changes up to a949e63992469fed87aef197347960ced31701b8:
pipe: fix comment in pipe_buf_operations (2016-10-05 18:24:00 -0400)
----------------------------------------------------------------
Al Viro (11):
consistent treatment of EFAULT on O_DIRECT read/write
splice_to_pipe(): don't open-code wakeup_pipe_readers()
splice: switch get_iovec_page_array() to iov_iter
splice: lift pipe_lock out of splice_to_pipe()
new helper: add_to_pipe()
skb_splice_bits(): get rid of callback
fuse_dev_splice_read(): switch to add_to_pipe()
new iov_iter flavour: pipe-backed
switch generic_file_splice_read() to use of ->read_iter()
switch default_file_splice_read() to use of pipe-backed iov_iter
relay: simplify relay_file_read()
Miklos Szeredi (5):
pipe: add pipe_buf_get() helper
pipe: add pipe_buf_release() helper
pipe: add pipe_buf_confirm() helper
pipe: add pipe_buf_steal() helper
pipe: fix comment in pipe_buf_operations
drivers/char/virtio_console.c | 2 +-
drivers/staging/lustre/lustre/llite/file.c | 89 +--
.../staging/lustre/lustre/llite/llite_internal.h | 15 +-
drivers/staging/lustre/lustre/llite/vvp_internal.h | 14 -
drivers/staging/lustre/lustre/llite/vvp_io.c | 45 +-
fs/coda/file.c | 23 +-
fs/direct-io.c | 3 +
fs/fuse/dev.c | 63 +-
fs/gfs2/file.c | 28 +-
fs/nfs/file.c | 25 +-
fs/nfs/internal.h | 2 -
fs/nfs/nfs4file.c | 2 +-
fs/ocfs2/file.c | 34 +-
fs/ocfs2/ocfs2_trace.h | 2 -
fs/pipe.c | 13 +-
fs/splice.c | 683 ++++++---------------
fs/xfs/xfs_file.c | 41 +-
fs/xfs/xfs_trace.h | 1 -
include/linux/fs.h | 2 -
include/linux/pipe_fs_i.h | 59 +-
include/linux/skbuff.h | 8 +-
include/linux/splice.h | 3 +
include/linux/uio.h | 14 +-
kernel/relay.c | 78 +--
lib/iov_iter.c | 395 +++++++++++-
mm/shmem.c | 115 +---
net/core/skbuff.c | 28 +-
net/ipv4/tcp.c | 3 +-
net/kcm/kcmsock.c | 16 +-
net/unix/af_unix.c | 17 +-
30 files changed, 746 insertions(+), 1077 deletions(-)
On Fri, Oct 7, 2016 at 3:20 PM, Al Viro <[email protected]> wrote:
> splice stuff.
Hmm. I've now gotten two oopses today, all at __kmalloc+0xc3/0x1f0,
which seems to be the
*(void **)(object + s->offset);
in get_freepointer(). Because it started happening today, I'm inclined
to blame mainly stuff I merged late yesterday.
I'm pretty sure that 4.8.0-09134-g4c1fad64eff4 is all good, in
particular, while the problems definitely happen with
4.8.0-11288-gb66484cd7470.
Much of the stuff yesterday was non-x86 archiectures (the ARM soc
stuff, avr32,parisc and power), so the main suspects are
- Andrew's series
- Al's splice stuff
- Ted's ext4 changes
- Jens' block layer changes
yes, there are other things that came in between there, not just the
architecture things, but they seem much less likely to trigger for me.
The traces don't really give me any real ideas, they look like this:
BUG: unable to handle kernel paging request at ffff9db749d0c000
IP: [<ffffffffb320cbe3>] __kmalloc+0xc3/0x1f0
PGD 426098067
PUD 426099067
PMD 344b1a067
PTE 0
Oops: 0000 [#1] SMP
Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter
xt_conntrack ebtable_nat ebtable_broute bridge st
acpi_als pinctrl_sunrisepoint tpm_tis pinctrl_intel kfifo_buf
tpm_tis_core tpm industrialio acpi_pad nfsd auth_rpcgss nfs_acl lockd
grace sunrpc dm_crypt i915 i2c_algo_bit drm_kms_helper
crct10dif_pclmul crc32_pclm
CPU: 0 PID: 3091 Comm: collect2 Tainted: G O
4.8.0-11288-gb66484cd7470-dirty #4
Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
task: ffff8ee43dbad940 task.stack: ffff9db749ee4000
RIP: 0010:[<ffffffffb320cbe3>] [<ffffffffb320cbe3>] __kmalloc+0xc3/0x1f0
RSP: 0018:ffff9db749ee7b80 EFLAGS: 00010246
RAX: ffff9db749d0c000 RBX: 00000000024000c0 RCX: 0000000000000000
RDX: 00000000000034f7 RSI: 0000000000000000 RDI: 000000000001b620
RBP: ffff9db749ee7bb0 R08: ffff8ee4b6c1b620 R09: ffff8ee475810b3f
R10: ffff9db749d0c000 R11: ffff8ee488a16240 R12: 00000000024000c0
R13: 0000000000000044 R14: ffff8ee4a60037c0 R15: ffff8ee4a60037c0
FS: 00007f3f8b10f740(0000) GS:ffff8ee4b6c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff9db749d0c000 CR3: 00000003a188a000 CR4: 00000000003406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffffffffb3258f68 ffff9db749ee7c40 0000000000000024 ffff8ee3f5810b40
7974697275636573 ffffffffb3c99760 ffff9db749ee7bd8 ffffffffb3258f68
ffff9db749ee7c40 ffff8ee4a1cb7378 ffff8ee4a1cb7360 ffff9db749ee7c10
Call Trace:
[<ffffffffb3258f68>] ? simple_xattr_alloc+0x28/0x60
[<ffffffffb3258f68>] simple_xattr_alloc+0x28/0x60
[<ffffffffb31bec60>] shmem_initxattrs+0x90/0xd0
[<ffffffffb333e60a>] security_inode_init_security+0x11a/0x160
[<ffffffffb31bebd0>] ? shmem_fh_to_dentry+0x60/0x60
[<ffffffffb31c00e2>] shmem_mknod+0x62/0xd0
[<ffffffffb31c0418>] shmem_create+0x18/0x20
[<ffffffffb324110a>] path_openat+0x128a/0x13c0
[<ffffffffb3242541>] do_filp_open+0x91/0x100
[<ffffffffb325051f>] ? __alloc_fd+0x3f/0x170
[<ffffffffb322fe10>] do_sys_open+0x130/0x220
[<ffffffffb322ff1e>] SyS_open+0x1e/0x20
[<ffffffffb379df20>] entry_SYSCALL_64_fastpath+0x13/0x94
Code: 49 83 78 10 00 4d 8b 10 0f 84 ce 00 00 00 4d 85 d2 0f 84 c5 00
00 00 49 63 47 20 49 8b 3f 4c 01 d0 40 f6 c7 0f 0f 85 1a 01 00 00 <48>
8b 18 48 8d 4a 01 4c 89 d0 65 48 0f c7 0f 0f 94 c0 84 c0 74
RIP [<ffffffffb320cbe3>] __kmalloc+0xc3/0x1f0
and
general protection fault: 0000 [#1] SMP
Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
xt_conntrack ebtable_nat ebtable_broute bridge st
acpi_als pinctrl_sunrisepoint kfifo_buf pinctrl_intel industrialio
tpm_tis tpm_tis_core tpm acpi_pad nfsd auth_rpcgss nfs_acl lockd grace
sunrpc dm_crypt i915 crct10dif_pclmul crc32_pclmul crc32c_intel
i2c_algo_bit
CPU: 5 PID: 3649 Comm: make Not tainted 4.8.0-11290-g13510890a847-dirty #3
Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
task: ffff8e3738188000 task.stack: ffffabe649e88000
RIP: 0010:[<ffffffff8720cd63>] [<ffffffff8720cd63>] __kmalloc+0xc3/0x1f0
RSP: 0018:ffffabe649e8bc38 EFLAGS: 00010246
RAX: 1e7acd36f90e784c RBX: 00000000024080c0 RCX: ffff8e36e78631f4
RDX: 000000000000284a RSI: 0000000000000000 RDI: 000000000001b620
RBP: ffffabe649e8bc68 R08: ffff8e3776d5b620 R09: 0000000084200088
R10: 1e7acd36f90e784c R11: 0000000069636574 R12: 00000000024080c0
R13: 000000000000004b R14: ffff8e37660037c0 R15: ffff8e37660037c0
FS: 00007f020d92a740(0000) GS:ffff8e3776d40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffed15ee080 CR3: 00000003f815a000 CR4: 00000000003406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffffffff872bbb8e ffff8e36faebc818 ffff8e3717123100 00000000b0b3acc0
000000000ce85fb7 ffff8e36e78631f4 ffffabe649e8bca8 ffffffff872bbb8e
ffffabe649e8bcf0 ffff8e36faebc818 ffffabe649e8bd80 ffff8e36e78631fc
Call Trace:
[<ffffffff872bbb8e>] ? ext4_htree_store_dirent+0x3e/0x120
[<ffffffff872bbb8e>] ext4_htree_store_dirent+0x3e/0x120
[<ffffffff872cd427>] htree_dirblock_to_tree+0xc7/0x1c0
[<ffffffff872ce572>] ext4_htree_fill_tree+0xb2/0x320
[<ffffffff871e0da1>] ? special_mapping_fault+0x31/0xa0
[<ffffffff872bb900>] ext4_readdir+0x660/0x890
[<ffffffff8734620d>] ? __inode_security_revalidate+0x4d/0x70
[<ffffffff87245f22>] iterate_dir+0x172/0x1a0
[<ffffffff87246398>] SyS_getdents+0x98/0x120
[<ffffffff87246120>] ? fillonedir+0xc0/0xc0
[<ffffffff8779e0a0>] entry_SYSCALL_64_fastpath+0x13/0x94
Code: 49 83 78 10 00 4d 8b 10 0f 84 ce 00 00 00 4d 85 d2 0f 84 c5 00
00 00 49 63 47 20 49 8b 3f 4c 01 d0 40 f6 c7 0f 0f 85 1a 01 00 00 <48>
8b 18 48 8d 4a 01 4c 89 d0 65 48 0f c7 0f 0f 94 c0 84 c0 74
RIP [<ffffffff8720cd63>] __kmalloc+0xc3/0x1f0
RSP <ffffabe649e8bc38>
---[ end trace 843edceadb3bd424 ]---
so in both cases it was filesystem stuff, but I'm not sure how much of
a pattern that is.
The trapping instruction is just a
mov (%rax),%rbx
and as you can see rax is garbage.
I guess I'll need to just run with slab debugging on, but I wanted to
bring this to peoples attention in case it rings a bell for somebody.
I haven't been merging anything today, partly because of this.
The problem *may* go back further, but I did run 4c1fad64eff4 for a
while without any sign of this.
Linus
On Sat, Oct 8, 2016 at 11:05 PM, Linus Torvalds
<[email protected]> wrote:
>
> Hmm. I've now gotten two oopses today, all at __kmalloc+0xc3/0x1f0,
> which seems to be the
>
> *(void **)(object + s->offset);
>
> in get_freepointer().
Actually, it's in "get_freepointer_safe()", it's just that without
DEBUG_PAGEALLOC the two end up being the same.
> I guess I'll need to just run with slab debugging on, but I wanted to
> bring this to peoples attention in case it rings a bell for somebody.
> I haven't been merging anything today, partly because of this.
Hmm. When I enabled SLUB debugging, I also enabled DEBUG_PAGEALLOC,
because "why not". But it turns out that that may have been a mistake,
because it changes the very path that failed to no longer do that
failing access (or rather, it does it as a "probe_kernel_read()",
which traps and ignores the failure).
So all my "careful" testing seems to have been pointless, because I
enabled too much debugging, making sure that the problem cannot
happen. No wonder I couldn't reproduce this.
I'll continue with *just* SLUB debugging on, but I thought it was
interesting how enabling more memory access debugging actually ends up
changing some really subtle code.
The "get_freepointer_safe()" thing is explicitly doing a read that
could be to free'd memory, and it then depends on doing the
this_cpu_cmpxchg_double() to abort the operation if it's no longer
valid.
I'm adding Christoph to the cc, not because the slub code has changed
lately (this optimistic access logic is 5+ years old), but because
maybe Christoph remembers what tends to trigger these kinds of issues.
Christoph, the problem is that something is triggering an oops or page
fault (depending on how bogus the address is) in __kmalloc() when it
does that get_freepointer_safe() thing without DEBUG_PAGEALLOC. I've
seen two different cases on two different boots, but they both were on
that one instruction that did that
void *next_object = get_freepointer_safe(s, object);
access. Both were to random kmalloc'ed memory (it *may* be a very
specific size that sees the corruption, but it's hard to tell, the
callchains were different and in both cases depended on some dynamic
length thing - once the directory entry name, in another case the
xattr name length).
The subject line is about Al's splice pull, but that's only one of the
ones I suspect are the potential causes. It could easily be Andrew's
pile (maybe that nice fsnotify locking cleanup causes double free's?),
Ted's ext4 changes (didn't look whether that could have allocation
pattern changes with bugs) or Jens' block layer changes.
Could be elsewhere too. I saw it twice in one day which would *tend*
to mean that it's recent, but maybe I was just lucky the previous days
and didn't hit it. I haven't been able to repro it now, but maybe I
figured out one reason why my reproductions have been failing ;)
Linus
On Sun, Oct 9, 2016 at 11:40 AM, Linus Torvalds
<[email protected]> wrote:
>
> I'll continue with *just* SLUB debugging on, but I thought it was
> interesting how enabling more memory access debugging actually ends up
> changing some really subtle code.
Indeed, now with DEBUG_PAGEALLOC disabled, I got a crash again. It
apparently happened earlier in the call chain (so maybe the slub
debugging found something), which should be good. Except it now
happens in a context where I just see a hung machine, and nothing
makes it onto the screen or into the logs ;(
So this thing is apparently not all that hard to trigger, but it
doesn't exactly seem easy either. It tends to happen fairly soon after
a reboot, which makes me suspect it's some cold-cache issue, but that
doesn't narrow things down as much as you'd think. It could still be
the block layer changes, but it could also equally well be the ext4
changes.
I don't think there's any splice activity anywhere, but who knows. And
the splice changes could have buggered the pipe locking, so..
Anyway, I don't think I can bisect it, but I'll try to narrow it down
a *bit* at least.
Not doing any more pulls on this unstable base, I've been puttering
around in trying to clean up some stupid printk logging issues
instead.
Linus
On Sun, 9 Oct 2016, Linus Torvalds wrote:
> Hmm. When I enabled SLUB debugging, I also enabled DEBUG_PAGEALLOC,
> because "why not". But it turns out that that may have been a mistake,
> because it changes the very path that failed to no longer do that
> failing access (or rather, it does it as a "probe_kernel_read()",
> which traps and ignores the failure).
DEBUG_PAGEALLOC significantly changes the layout of objects and thus this
may no longer trigger.
> I'll continue with *just* SLUB debugging on, but I thought it was
> interesting how enabling more memory access debugging actually ends up
> changing some really subtle code.
Debugging options to memory allocation functions can change the memory
layout which may cause the corruption to no longer happen or no longer
happen the same way. Surely wish there would be another way.
> Christoph, the problem is that something is triggering an oops or page
> fault (depending on how bogus the address is) in __kmalloc() when it
> does that get_freepointer_safe() thing without DEBUG_PAGEALLOC. I've
> seen two different cases on two different boots, but they both were on
> that one instruction that did that
Hmm.. Then get_freepointer_safe may not be ok. Should not trigger any
faults.
> Could be elsewhere too. I saw it twice in one day which would *tend*
> to mean that it's recent, but maybe I was just lucky the previous days
> and didn't hit it. I haven't been able to repro it now, but maybe I
> figured out one reason why my reproductions have been failing ;)
Ok reading the rest of the thread it seems that we found the issue but
still this get_freepointer_safe failure is not good. Do you have some more
debugging output that can shed some more light on the failure of
get_freepointer_safe?
On Mon, Oct 10, 2016 at 7:03 AM, Christoph Lameter <[email protected]> wrote:
>
> Hmm.. Then get_freepointer_safe may not be ok. Should not trigger any
> faults.
So the reason seems to be that SLUB doesn't actually react well to
double-freeing bugs.
I'm not sure how to fix that. I think the optimistic load that SLUB
does is actually important, since it is what allows the whole
lock-free double_cmpxchg() approach.
But the fact that it reacts _so_ badly to double-freeing issues when
the freelist has become corrupted due to an object being free'd and
then modified is clearly very fragile and not great.
Doing a google search for "kmalloc", "oops" and "cmpxchg16b" does show
that it happens: you can tell by how the trapping instruction is a
load just before the cmpxchg16b instruction in the oops disassembly.
Maybe we should just make "get_freepointer()" always handle traps.
Right now it does that "probe_kernel_read()" conditionally, and it's a
fairly costly operation, but we *could* make it cheaper. It's really
just a single instruction with an exception entry (kind of like
load_unaligned_zeropad() that we wrote for the dcache case).
I dunno.
Linus
On Mon, 10 Oct 2016, Linus Torvalds wrote:
> But the fact that it reacts _so_ badly to double-freeing issues when
> the freelist has become corrupted due to an object being free'd and
> then modified is clearly very fragile and not great.
Yup that is why the debug options move the freepointer after the object
and verify that the pointers in the chain point to valid objects in the
slab page. slub_debug has special logic to detect double freeing and that
option can be enabled separatelhy.