Hi Tj, all,
I can get kernel panic on my machine with kernel 3.9.0-rc7-4-gbb33db7 every time
after booting for several minutes.
Here attached the panic picture and the config.
I'm sorry that the panic log is not completed.
But fortunately, the panic was gone when I reverted the below commit from Tj,
commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f
Author: Tejun Heo <[email protected]>
Date: Fri Jan 11 13:06:33 2013 -0800
block: add missing block_bio_complete() tracepoint
I think this will be helpful for you to resolve this bug, and this may be urgent,
because it's already rc-7 now.
Thanks,
Wanlong Gao
On Wed, 2013-04-17 at 16:36 +0800, Wanlong Gao wrote:
> Hi Tj, all,
>
> I can get kernel panic on my machine with kernel 3.9.0-rc7-4-gbb33db7 every time
> after booting for several minutes.
>
> Here attached the panic picture and the config.
> I'm sorry that the panic log is not completed.
>
> But fortunately, the panic was gone when I reverted the below commit from Tj,
>
> commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f
> Author: Tejun Heo <[email protected]>
> Date: Fri Jan 11 13:06:33 2013 -0800
>
> block: add missing block_bio_complete() tracepoint
>
> I think this will be helpful for you to resolve this bug, and this may be urgent,
> because it's already rc-7 now.
Doing a objdump on the code I found this:
ffffffff8123d479: 8b 77 48 mov 0x48(%rdi),%esi
ffffffff8123d47c: 48 8d 7d c8 lea -0x38(%rbp),%rdi
ffffffff8123d480: 48 8b 4d b8 mov -0x48(%rbp),%rcx
ffffffff8123d484: ba 28 00 00 00 mov $0x28,%edx
ffffffff8123d489: 45 89 e8 mov %r13d,%r8d
ffffffff8123d48c: e8 1f d3 ea ff callq ffffffff810ea7b0 <trace_current_buffer_lock_reserve>
ffffffff8123d491: 48 85 c0 test %rax,%rax
ffffffff8123d494: 49 89 c6 mov %rax,%r14
ffffffff8123d497: 74 5a je ffffffff8123d4f3 <ftrace_raw_event_block_bio_complete+0xb3>
ffffffff8123d499: 48 89 c7 mov %rax,%rdi
ffffffff8123d49c: e8 5f 4f ea ff callq ffffffff810e2400 <ring_buffer_event_data>
ffffffff8123d4a1: 48 8b 4b 10 mov 0x10(%rbx),%rcx
ffffffff8123d4a5: 31 d2 xor %edx,%edx
ffffffff8123d4a7: 48 85 c9 test %rcx,%rcx
ffffffff8123d4aa: 74 02 je ffffffff8123d4ae <ftrace_raw_event_block_bio_complete+0x6e>
ffffffff8123d4ac: 8b 11 mov (%rcx),%edx <<<---- BUG IS HERE
ffffffff8123d4ae: 89 50 08 mov %edx,0x8(%rax)
ffffffff8123d4b1: 48 8b 13 mov (%rbx),%rdx
ffffffff8123d4b4: 48 8d 78 20 lea 0x20(%rax),%rdi
ffffffff8123d4b8: 48 89 50 10 mov %rdx,0x10(%rax)
ffffffff8123d4bc: 8b 53 30 mov 0x30(%rbx),%edx
ffffffff8123d4bf: 44 89 78 1c mov %r15d,0x1c(%rax)
ffffffff8123d4c3: c1 ea 09 shr $0x9,%edx
ffffffff8123d4c6: 89 50 18 mov %edx,0x18(%rax)
ffffffff8123d4c9: 8b 53 30 mov 0x30(%rbx),%edx
ffffffff8123d4cc: 48 8b 73 20 mov 0x20(%rbx),%rsi
According to the picture, that showed where in the code dump, it has the
command that died: "<8b> 11 89 50 08".
Just before that, rcx was tested for zero, and passed (wasn't zero). But
it seems that it wasn't a real address space either.
bio is allocated with kmalloc and not kzalloc, so it is possible that if
bio->bi_bdev never got initialized and is just garbage.
-- Steve
>
>
> Thanks,
> Wanlong Gao
On Wed, Apr 17 2013, Steven Rostedt wrote:
> On Wed, 2013-04-17 at 16:36 +0800, Wanlong Gao wrote:
> > Hi Tj, all,
> >
> > I can get kernel panic on my machine with kernel 3.9.0-rc7-4-gbb33db7 every time
> > after booting for several minutes.
> >
> > Here attached the panic picture and the config.
> > I'm sorry that the panic log is not completed.
> >
> > But fortunately, the panic was gone when I reverted the below commit from Tj,
> >
> > commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f
> > Author: Tejun Heo <[email protected]>
> > Date: Fri Jan 11 13:06:33 2013 -0800
> >
> > block: add missing block_bio_complete() tracepoint
> >
> > I think this will be helpful for you to resolve this bug, and this may be urgent,
> > because it's already rc-7 now.
>
>
> Doing a objdump on the code I found this:
>
> ffffffff8123d479: 8b 77 48 mov 0x48(%rdi),%esi
> ffffffff8123d47c: 48 8d 7d c8 lea -0x38(%rbp),%rdi
> ffffffff8123d480: 48 8b 4d b8 mov -0x48(%rbp),%rcx
> ffffffff8123d484: ba 28 00 00 00 mov $0x28,%edx
> ffffffff8123d489: 45 89 e8 mov %r13d,%r8d
> ffffffff8123d48c: e8 1f d3 ea ff callq ffffffff810ea7b0 <trace_current_buffer_lock_reserve>
> ffffffff8123d491: 48 85 c0 test %rax,%rax
> ffffffff8123d494: 49 89 c6 mov %rax,%r14
> ffffffff8123d497: 74 5a je ffffffff8123d4f3 <ftrace_raw_event_block_bio_complete+0xb3>
> ffffffff8123d499: 48 89 c7 mov %rax,%rdi
> ffffffff8123d49c: e8 5f 4f ea ff callq ffffffff810e2400 <ring_buffer_event_data>
> ffffffff8123d4a1: 48 8b 4b 10 mov 0x10(%rbx),%rcx
> ffffffff8123d4a5: 31 d2 xor %edx,%edx
> ffffffff8123d4a7: 48 85 c9 test %rcx,%rcx
> ffffffff8123d4aa: 74 02 je ffffffff8123d4ae <ftrace_raw_event_block_bio_complete+0x6e>
>
> ffffffff8123d4ac: 8b 11 mov (%rcx),%edx <<<---- BUG IS HERE
>
> ffffffff8123d4ae: 89 50 08 mov %edx,0x8(%rax)
> ffffffff8123d4b1: 48 8b 13 mov (%rbx),%rdx
> ffffffff8123d4b4: 48 8d 78 20 lea 0x20(%rax),%rdi
> ffffffff8123d4b8: 48 89 50 10 mov %rdx,0x10(%rax)
> ffffffff8123d4bc: 8b 53 30 mov 0x30(%rbx),%edx
> ffffffff8123d4bf: 44 89 78 1c mov %r15d,0x1c(%rax)
> ffffffff8123d4c3: c1 ea 09 shr $0x9,%edx
> ffffffff8123d4c6: 89 50 18 mov %edx,0x18(%rax)
> ffffffff8123d4c9: 8b 53 30 mov 0x30(%rbx),%edx
> ffffffff8123d4cc: 48 8b 73 20 mov 0x20(%rbx),%rsi
>
> According to the picture, that showed where in the code dump, it has the
> command that died: "<8b> 11 89 50 08".
>
> Just before that, rcx was tested for zero, and passed (wasn't zero). But
> it seems that it wasn't a real address space either.
>
> bio is allocated with kmalloc and not kzalloc, so it is possible that if
> bio->bi_bdev never got initialized and is just garbage.
A bio is always fully initialized, regardless of which internal
allocator it came from. If people are doing private kmallocs, then they
better be using bio_init() as well.
Wanlong, would it be possible to get a full dmesg on boot see I can see
what drivers and file systems are in use? Anything special about your
setup.
Will get this reverted for 3.9 final, so we have some time to
investigate.
--
Jens Axboe
On 04/18/2013 08:37 PM, Jens Axboe wrote:
> On Wed, Apr 17 2013, Steven Rostedt wrote:
>> On Wed, 2013-04-17 at 16:36 +0800, Wanlong Gao wrote:
>>> Hi Tj, all,
>>>
>>> I can get kernel panic on my machine with kernel 3.9.0-rc7-4-gbb33db7 every time
>>> after booting for several minutes.
>>>
>>> Here attached the panic picture and the config.
>>> I'm sorry that the panic log is not completed.
>>>
>>> But fortunately, the panic was gone when I reverted the below commit from Tj,
>>>
>>> commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f
>>> Author: Tejun Heo <[email protected]>
>>> Date: Fri Jan 11 13:06:33 2013 -0800
>>>
>>> block: add missing block_bio_complete() tracepoint
>>>
>>> I think this will be helpful for you to resolve this bug, and this may be urgent,
>>> because it's already rc-7 now.
>>
>>
>> Doing a objdump on the code I found this:
>>
>> ffffffff8123d479: 8b 77 48 mov 0x48(%rdi),%esi
>> ffffffff8123d47c: 48 8d 7d c8 lea -0x38(%rbp),%rdi
>> ffffffff8123d480: 48 8b 4d b8 mov -0x48(%rbp),%rcx
>> ffffffff8123d484: ba 28 00 00 00 mov $0x28,%edx
>> ffffffff8123d489: 45 89 e8 mov %r13d,%r8d
>> ffffffff8123d48c: e8 1f d3 ea ff callq ffffffff810ea7b0 <trace_current_buffer_lock_reserve>
>> ffffffff8123d491: 48 85 c0 test %rax,%rax
>> ffffffff8123d494: 49 89 c6 mov %rax,%r14
>> ffffffff8123d497: 74 5a je ffffffff8123d4f3 <ftrace_raw_event_block_bio_complete+0xb3>
>> ffffffff8123d499: 48 89 c7 mov %rax,%rdi
>> ffffffff8123d49c: e8 5f 4f ea ff callq ffffffff810e2400 <ring_buffer_event_data>
>> ffffffff8123d4a1: 48 8b 4b 10 mov 0x10(%rbx),%rcx
>> ffffffff8123d4a5: 31 d2 xor %edx,%edx
>> ffffffff8123d4a7: 48 85 c9 test %rcx,%rcx
>> ffffffff8123d4aa: 74 02 je ffffffff8123d4ae <ftrace_raw_event_block_bio_complete+0x6e>
>>
>> ffffffff8123d4ac: 8b 11 mov (%rcx),%edx <<<---- BUG IS HERE
>>
>> ffffffff8123d4ae: 89 50 08 mov %edx,0x8(%rax)
>> ffffffff8123d4b1: 48 8b 13 mov (%rbx),%rdx
>> ffffffff8123d4b4: 48 8d 78 20 lea 0x20(%rax),%rdi
>> ffffffff8123d4b8: 48 89 50 10 mov %rdx,0x10(%rax)
>> ffffffff8123d4bc: 8b 53 30 mov 0x30(%rbx),%edx
>> ffffffff8123d4bf: 44 89 78 1c mov %r15d,0x1c(%rax)
>> ffffffff8123d4c3: c1 ea 09 shr $0x9,%edx
>> ffffffff8123d4c6: 89 50 18 mov %edx,0x18(%rax)
>> ffffffff8123d4c9: 8b 53 30 mov 0x30(%rbx),%edx
>> ffffffff8123d4cc: 48 8b 73 20 mov 0x20(%rbx),%rsi
>>
>> According to the picture, that showed where in the code dump, it has the
>> command that died: "<8b> 11 89 50 08".
>>
>> Just before that, rcx was tested for zero, and passed (wasn't zero). But
>> it seems that it wasn't a real address space either.
>>
>> bio is allocated with kmalloc and not kzalloc, so it is possible that if
>> bio->bi_bdev never got initialized and is just garbage.
>
> A bio is always fully initialized, regardless of which internal
> allocator it came from. If people are doing private kmallocs, then they
> better be using bio_init() as well.
>
> Wanlong, would it be possible to get a full dmesg on boot see I can see
> what drivers and file systems are in use? Anything special about your
> setup.
Sure, attached.
Thanks,
Wanlong Gao
>
> Will get this reverted for 3.9 final, so we have some time to
> investigate.
>
On Thu, Apr 18 2013, Wanlong Gao wrote:
> > A bio is always fully initialized, regardless of which internal
> > allocator it came from. If people are doing private kmallocs, then they
> > better be using bio_init() as well.
> >
> > Wanlong, would it be possible to get a full dmesg on boot see I can see
> > what drivers and file systems are in use? Anything special about your
> > setup.
>
> Sure, attached.
Does the below help?
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 9e5b8c2..f9a51a6 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -780,15 +780,23 @@ static void blk_add_trace_bio_bounce(void *ignore,
blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0);
}
+static struct request_queue *blk_trace_get_queue(struct block_device *bdev)
+{
+ if (bdev->bd_disk == NULL)
+ return NULL;
+
+ return bdev_get_queue(bdev);
+}
+
static void blk_add_trace_bio_complete(void *ignore, struct bio *bio, int error)
{
struct request_queue *q;
struct blk_trace *bt;
- if (!bio->bi_bdev)
+ q = blk_trace_get_queue(bio->bi_bdev);
+ if (!q)
return;
- q = bdev_get_queue(bio->bi_bdev);
bt = q->blk_trace;
/*
@@ -1641,14 +1649,6 @@ static ssize_t blk_trace_mask2str(char *buf, int mask)
return p - buf;
}
-static struct request_queue *blk_trace_get_queue(struct block_device *bdev)
-{
- if (bdev->bd_disk == NULL)
- return NULL;
-
- return bdev_get_queue(bdev);
-}
-
static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
struct device_attribute *attr,
char *buf)
--
Jens Axboe
On 04/18/2013 09:35 PM, Jens Axboe wrote:
> On Thu, Apr 18 2013, Wanlong Gao wrote:
>>> A bio is always fully initialized, regardless of which internal
>>> allocator it came from. If people are doing private kmallocs, then they
>>> better be using bio_init() as well.
>>>
>>> Wanlong, would it be possible to get a full dmesg on boot see I can see
>>> what drivers and file systems are in use? Anything special about your
>>> setup.
>>
>> Sure, attached.
>
> Does the below help?
Still got panic with this patch. :-(
Thanks,
Wanlong Gao
>
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 9e5b8c2..f9a51a6 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -780,15 +780,23 @@ static void blk_add_trace_bio_bounce(void *ignore,
> blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0);
> }
>
> +static struct request_queue *blk_trace_get_queue(struct block_device *bdev)
> +{
> + if (bdev->bd_disk == NULL)
> + return NULL;
> +
> + return bdev_get_queue(bdev);
> +}
> +
> static void blk_add_trace_bio_complete(void *ignore, struct bio *bio, int error)
> {
> struct request_queue *q;
> struct blk_trace *bt;
>
> - if (!bio->bi_bdev)
> + q = blk_trace_get_queue(bio->bi_bdev);
> + if (!q)
> return;
>
> - q = bdev_get_queue(bio->bi_bdev);
> bt = q->blk_trace;
>
> /*
> @@ -1641,14 +1649,6 @@ static ssize_t blk_trace_mask2str(char *buf, int mask)
> return p - buf;
> }
>
> -static struct request_queue *blk_trace_get_queue(struct block_device *bdev)
> -{
> - if (bdev->bd_disk == NULL)
> - return NULL;
> -
> - return bdev_get_queue(bdev);
> -}
> -
> static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> struct device_attribute *attr,
> char *buf)
>
On Thu, Apr 18 2013, Wanlong Gao wrote:
> On 04/18/2013 09:35 PM, Jens Axboe wrote:
> > On Thu, Apr 18 2013, Wanlong Gao wrote:
> >>> A bio is always fully initialized, regardless of which internal
> >>> allocator it came from. If people are doing private kmallocs, then they
> >>> better be using bio_init() as well.
> >>>
> >>> Wanlong, would it be possible to get a full dmesg on boot see I can see
> >>> what drivers and file systems are in use? Anything special about your
> >>> setup.
> >>
> >> Sure, attached.
> >
> > Does the below help?
>
> Still got panic with this patch. :-(
Could you capture the full dump? I'd like to see what rcx was, and that
seems to have scrolled off.
--
Jens Axboe
On 04/18/2013 10:30 PM, Jens Axboe wrote:
> On Thu, Apr 18 2013, Wanlong Gao wrote:
>> On 04/18/2013 09:35 PM, Jens Axboe wrote:
>>> On Thu, Apr 18 2013, Wanlong Gao wrote:
>>>>> A bio is always fully initialized, regardless of which internal
>>>>> allocator it came from. If people are doing private kmallocs, then they
>>>>> better be using bio_init() as well.
>>>>>
>>>>> Wanlong, would it be possible to get a full dmesg on boot see I can see
>>>>> what drivers and file systems are in use? Anything special about your
>>>>> setup.
>>>>
>>>> Sure, attached.
>>>
>>> Does the below help?
>>
>> Still got panic with this patch. :-(
>
> Could you capture the full dump? I'd like to see what rcx was, and that
> seems to have scrolled off.
OK, but I should capture it tomorrow morning because this remote machine has already panicked
and need hard reboot.
Thanks,
Wanlong Gao
>
On Thu, Apr 18, 2013 at 7:30 AM, Jens Axboe <[email protected]> wrote:
>
> Could you capture the full dump? I'd like to see what rcx was, and that
> seems to have scrolled off.
You don't need it. The %cr2 value shows the address of the bad access,
and it's 0x00000001.
Linus
Hello,
On Thu, Apr 18, 2013 at 09:08:34AM -0700, Linus Torvalds wrote:
> On Thu, Apr 18, 2013 at 7:30 AM, Jens Axboe <[email protected]> wrote:
> >
> > Could you capture the full dump? I'd like to see what rcx was, and that
> > seems to have scrolled off.
>
> You don't need it. The %cr2 value shows the address of the bad access,
> and it's 0x00000001.
If I'm not mistaken, it seems like we have a bio which has 1 as its
->bi_bdev. Ummm.... no idea whatsoever how that can happen tho right
now. Looking into it.
Thanks.
--
tejun
On Thu, Apr 18 2013, Tejun Heo wrote:
> Hello,
>
> On Thu, Apr 18, 2013 at 09:08:34AM -0700, Linus Torvalds wrote:
> > On Thu, Apr 18, 2013 at 7:30 AM, Jens Axboe <[email protected]> wrote:
> > >
> > > Could you capture the full dump? I'd like to see what rcx was, and that
> > > seems to have scrolled off.
> >
> > You don't need it. The %cr2 value shows the address of the bad access,
> > and it's 0x00000001.
>
> If I'm not mistaken, it seems like we have a bio which has 1 as its
> ->bi_bdev. Ummm.... no idea whatsoever how that can happen tho right
> now. Looking into it.
Yep, thanks Linus for that hint... Must be someone abusing it for a
flag field post submission? Crazy.
--
Jens Axboe
Hello, Wanlong.
On Thu, Apr 18, 2013 at 10:45:10PM +0800, Wanlong Gao wrote:
> OK, but I should capture it tomorrow morning because this remote machine has already panicked
> and need hard reboot.
Can you please apply the following patch when you try the next time
and report the kernel log? It should prevent the oops from happening
and dump information on at least who are the involved parties.
Thanks.
diff --git a/block/blk-core.c b/block/blk-core.c
index 074b758..d5efebe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -150,6 +150,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
}
EXPORT_SYMBOL(blk_rq_init);
+
+
static void req_bio_endio(struct request *rq, struct bio *bio,
unsigned int nbytes, int error)
{
@@ -174,8 +176,15 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
bio_integrity_advance(bio, nbytes);
/* don't actually finish bio if it's part of flush sequence */
- if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+ if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ)) {
+ if (bio->bi_bdev && (unsigned long)bio->bi_bdev < 4096) {
+ printk_ratelimited("XXX req_bio_endio: queue %s bio=%p bi_bdev=%p bi_end_io=%pf\n",
+ (rq->q && rq->q->backing_dev_info.dev) ?
+ dev_name(rq->q->backing_dev_info.dev) : "<unknown>",
+ bio, bio->bi_bdev, bio->bi_end_io);
+ }
bio_endio(bio, error);
+ }
}
void blk_dump_rq_flags(struct request *rq, char *msg)
diff --git a/fs/bio.c b/fs/bio.c
index bb5768f..da64e5a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1428,7 +1428,10 @@ void bio_endio(struct bio *bio, int error)
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
error = -EIO;
- trace_block_bio_complete(bio, error);
+ if (bio->bi_bdev && (unsigned long)bio->bi_bdev < 4096)
+ printk_ratelimited("XXX bio_endio: skipping TP for %p\n", bio);
+ else
+ trace_block_bio_complete(bio, error);
if (bio->bi_end_io)
bio->bi_end_io(bio, error);
On Thu, Apr 18, 2013 at 10:39:00AM -0700, Jens Axboe wrote:
> > If I'm not mistaken, it seems like we have a bio which has 1 as its
> > ->bi_bdev. Ummm.... no idea whatsoever how that can happen tho right
> > now. Looking into it.
>
> Yep, thanks Linus for that hint... Must be someone abusing it for a
> flag field post submission? Crazy.
Let's hope that's not the case because there'll be blood if it is. :)
--
tejun
On Thu, Apr 18 2013, Tejun Heo wrote:
> On Thu, Apr 18, 2013 at 10:39:00AM -0700, Jens Axboe wrote:
> > > If I'm not mistaken, it seems like we have a bio which has 1 as its
> > > ->bi_bdev. Ummm.... no idea whatsoever how that can happen tho right
> > > now. Looking into it.
> >
> > Yep, thanks Linus for that hint... Must be someone abusing it for a
> > flag field post submission? Crazy.
>
> Let's hope that's not the case because there'll be blood if it is. :)
Yeah, it's beyond the amount of crazy I've come to expect from various
random users of IO interfaces :-)
--
Jens Axboe
On Thu, Apr 18, 2013 at 11:13 AM, Jens Axboe <[email protected]> wrote:
> On Thu, Apr 18 2013, Tejun Heo wrote:
>> On Thu, Apr 18, 2013 at 10:39:00AM -0700, Jens Axboe wrote:
>> >
>> > Yep, thanks Linus for that hint... Must be someone abusing it for a
>> > flag field post submission? Crazy.
>>
>> Let's hope that's not the case because there'll be blood if it is. :)
>
> Yeah, it's beyond the amount of crazy I've come to expect from various
> random users of IO interfaces :-)
I think it's more likely to be some use-after-free after a long timeout.
Wanlong says it happens a few minutes after boot, so maybe something
times out a command, does the blk_complete_request(), and free's the
bio, which gets re-used before the softirq actually ends up running.
I note that Wanlong uses the SLAB allocator, not the SLUB one. I
wonder if the thing goes away with SLUB, and if not, if
CONFIG_SLUB_DEBUG_ON=y might help debug it?
Linus
On Thu, Apr 18 2013, Linus Torvalds wrote:
> On Thu, Apr 18, 2013 at 11:13 AM, Jens Axboe <[email protected]> wrote:
> > On Thu, Apr 18 2013, Tejun Heo wrote:
> >> On Thu, Apr 18, 2013 at 10:39:00AM -0700, Jens Axboe wrote:
> >> >
> >> > Yep, thanks Linus for that hint... Must be someone abusing it for a
> >> > flag field post submission? Crazy.
> >>
> >> Let's hope that's not the case because there'll be blood if it is. :)
> >
> > Yeah, it's beyond the amount of crazy I've come to expect from various
> > random users of IO interfaces :-)
>
> I think it's more likely to be some use-after-free after a long timeout.
>
> Wanlong says it happens a few minutes after boot, so maybe something
> times out a command, does the blk_complete_request(), and free's the
> bio, which gets re-used before the softirq actually ends up running.
>
> I note that Wanlong uses the SLAB allocator, not the SLUB one. I
> wonder if the thing goes away with SLUB, and if not, if
> CONFIG_SLUB_DEBUG_ON=y might help debug it?
Hmm dunno. It happens right after we've completed the bio, which touches
a lot of fields too. bi_bdev sits between bi_next (which we definitely
used) and bi_flags.
But adding slab use-after-free debugging would show for sure.
--
Jens Axboe
Could it be a module not recompiled with a newer kernel?
-- Steve
Jens Axboe <[email protected]> wrote:
>
>Hmm dunno. It happens right after we've completed the bio, which
>touches
>a lot of fields too. bi_bdev sits between bi_next (which we definitely
>used) and bi_flags.
>
>But adding slab use-after-free debugging would show for sure.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
On 04/18/2013 10:30 PM, Jens Axboe wrote:
> On Thu, Apr 18 2013, Wanlong Gao wrote:
>> On 04/18/2013 09:35 PM, Jens Axboe wrote:
>>> On Thu, Apr 18 2013, Wanlong Gao wrote:
>>>>> A bio is always fully initialized, regardless of which internal
>>>>> allocator it came from. If people are doing private kmallocs, then they
>>>>> better be using bio_init() as well.
>>>>>
>>>>> Wanlong, would it be possible to get a full dmesg on boot see I can see
>>>>> what drivers and file systems are in use? Anything special about your
>>>>> setup.
>>>>
>>>> Sure, attached.
>>>
>>> Does the below help?
>>
>> Still got panic with this patch. :-(
>
> Could you capture the full dump? I'd like to see what rcx was, and that
> seems to have scrolled off.
>
I got the panic message using netconsole:
EXT4-fs (sda5): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (sda8): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (sda9): warning: maximal mount count reached, running e2fsck is recommended
EXT4-fs (sda9): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (sda10): mounted filesystem with ordered data mode. Opts: (null)
SGI XFS with ACLs, security attributes, large block/inode numbers, no debug enabled
XFS (sda11): Mounting Filesystem
XFS (sda11): Ending clean mount
kjournald starting. Commit interval 5 seconds
EXT3-fs (sda12): warning: maximal mount count reached, running e2fsck is recommended
EXT3-fs (sda12): using internal journal
EXT3-fs (sda12): mounted filesystem with ordered data mode
XFS (sda13): Mounting Filesystem
XFS (sda13): Ending clean mount
xor: measuring software checksum speed
prefetch64-sse: 7572.000 MB/sec
generic_sse: 6300.000 MB/sec
xor: using function: prefetch64-sse (7572.000 MB/sec)
raid6: sse2x1 4796 MB/s
raid6: sse2x2 5449 MB/s
raid6: sse2x4 5816 MB/s
raid6: using algorithm sse2x4 (5816 MB/s)
raid6: using ssse3x2 recovery algorithm
Btrfs loaded
device fsid 01548710-abce-4290-801d-5f19dde14497 devid 1 transid 131526 /dev/sda14
BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
IP: [<ffffffff812484d3>] ftrace_raw_event_block_bio_complete+0x73/0xf0
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: btrfs raid6_pq zlib_deflate xor xfs libcrc32c netconsole ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp ppdev parport_pc parport fuse cpufreq_ondemand cachefiles fscache bridge stp llc ipv6 ext3 jbd dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support sg acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode serio_raw pcspkr i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core igb dca i2c_algo_bit i2c_core ptp pps_core dm_mod ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix
CPU 3
Pid: 0, comm: swapper/3 Not tainted 3.9.0-rc7-4-gbb33db7 #1 Lenovo Lenovo WQ TR280 G3/TR350 G7/TB36D
RIP: 0010:[<ffffffff812484d3>] [<ffffffff812484d3>] ftrace_raw_event_block_bio_complete+0x73/0xf0
RSP: 0018:ffff8802bfc63bb8 EFLAGS: 00010202
RAX: ffff8802b1e88698 RBX: ffff8802b05f6bc0 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8802b1e8869c
RBP: ffff8802bfc63c08 R08: 000000000000000a R09: 0000000000000000
R10: 000000000000002c R11: 0000000000000000 R12: ffffffff81a95b60
R13: 0000000000000100 R14: ffff8802b1e88694 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8802bfc60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000001 CR3: 0000000001a0c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/3 (pid: 0, threadinfo ffff8802b4598000, task ffff8802b4594a80)
Stack:
ffff8802b4546c40 0000000000000286 0000000000000100 ffff8802b4546c40
ffff8802bfc63c08 ffff8802b1f82160 ffff8802b05f6bc0 0000000000000000
0000000000001000 0000000000000000 ffff8802bfc63c38 ffffffff811b6c10
Call Trace:
<IRQ>
[<ffffffff811b6c10>] bio_endio+0x80/0x90
[<ffffffffa0790d26>] btrfs_end_bio+0xf6/0x190 [btrfs]
[<ffffffff811b6bcd>] bio_endio+0x3d/0x90
[<ffffffff81249873>] req_bio_endio+0xa3/0xe0
[<ffffffff8124af9f>] blk_update_request+0x10f/0x480
[<ffffffff8124b337>] blk_update_bidi_request+0x27/0xb0
[<ffffffff8124c52f>] blk_end_bidi_request+0x2f/0x80
[<ffffffff8124c5d0>] blk_end_request+0x10/0x20
[<ffffffff8139c1b0>] scsi_end_request+0x40/0xb0
[<ffffffff8139c52f>] scsi_io_completion+0x9f/0x660
[<ffffffff81393899>] scsi_finish_command+0xc9/0x130
[<ffffffff8139cc57>] scsi_softirq_done+0x147/0x170
[<ffffffff81252c52>] blk_done_softirq+0x82/0xa0
[<ffffffff8105f537>] __do_softirq+0xe7/0x250
[<ffffffff8105f7a5>] irq_exit+0xb5/0xc0
[<ffffffff81570966>] do_IRQ+0x66/0xe0
[<ffffffff8156662d>] common_interrupt+0x6d/0x6d
<EOI>
[<ffffffff81449e30>] ? cpuidle_wrap_enter+0x50/0x90
[<ffffffff81449e29>] ? cpuidle_wrap_enter+0x49/0x90
[<ffffffff81449e80>] cpuidle_enter_tk+0x10/0x20
[<ffffffff81449887>] cpuidle_enter_state+0x17/0x50
[<ffffffff8144a0ca>] cpuidle_idle_call+0xaa/0x110
[<ffffffff8101cd0f>] cpu_idle+0xcf/0x120
[<ffffffff8155aa9a>] start_secondary+0xc8/0xca
Code: 8b 4d b8 ba 28 00 00 00 45 89 e8 e8 e8 2c eb ff 48 85 c0 49 89 c6 74 5a 48 89 c7 e8 08 c6 ea ff 48 8b 4b 10 31 d2 48 85 c9 74 02 <8b> 11 89 50 08 48 8b 13 48 8d 78 20 48 89 50 10 8b 53 30 44 89
RIP [<ffffffff812484d3>] ftrace_raw_event_block_bio_complete+0x73/0xf0
RSP <ffff8802bfc63bb8>
CR2: 0000000000000001
---[ end trace b13662c9f1a097d8 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Thanks,
Wanlong Gao
On 04/19/2013 01:52 AM, Tejun Heo wrote:
> Hello, Wanlong.
>
> On Thu, Apr 18, 2013 at 10:45:10PM +0800, Wanlong Gao wrote:
>> OK, but I should capture it tomorrow morning because this remote machine has already panicked
>> and need hard reboot.
>
> Can you please apply the following patch when you try the next time
> and report the kernel log? It should prevent the oops from happening
> and dump information on at least who are the involved parties.
Here attached.
Thanks,
Wanlong Gao
>
> Thanks.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 074b758..d5efebe 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -150,6 +150,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
> }
> EXPORT_SYMBOL(blk_rq_init);
>
> +
> +
> static void req_bio_endio(struct request *rq, struct bio *bio,
> unsigned int nbytes, int error)
> {
> @@ -174,8 +176,15 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
> bio_integrity_advance(bio, nbytes);
>
> /* don't actually finish bio if it's part of flush sequence */
> - if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
> + if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ)) {
> + if (bio->bi_bdev && (unsigned long)bio->bi_bdev < 4096) {
> + printk_ratelimited("XXX req_bio_endio: queue %s bio=%p bi_bdev=%p bi_end_io=%pf\n",
> + (rq->q && rq->q->backing_dev_info.dev) ?
> + dev_name(rq->q->backing_dev_info.dev) : "<unknown>",
> + bio, bio->bi_bdev, bio->bi_end_io);
> + }
> bio_endio(bio, error);
> + }
> }
>
> void blk_dump_rq_flags(struct request *rq, char *msg)
> diff --git a/fs/bio.c b/fs/bio.c
> index bb5768f..da64e5a 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1428,7 +1428,10 @@ void bio_endio(struct bio *bio, int error)
> else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
> error = -EIO;
>
> - trace_block_bio_complete(bio, error);
> + if (bio->bi_bdev && (unsigned long)bio->bi_bdev < 4096)
> + printk_ratelimited("XXX bio_endio: skipping TP for %p\n", bio);
> + else
> + trace_block_bio_complete(bio, error);
>
> if (bio->bi_end_io)
> bio->bi_end_io(bio, error);
>
(cc'ing btrfs people)
On Fri, Apr 19, 2013 at 11:33:20AM +0800, Wanlong Gao wrote:
> RIP: 0010:[<ffffffff812484d3>] [<ffffffff812484d3>] ftrace_raw_event_block_bio_complete+0x73/0xf0
...
> [<ffffffff811b6c10>] bio_endio+0x80/0x90
> [<ffffffffa0790d26>] btrfs_end_bio+0xf6/0x190 [btrfs]
> [<ffffffff811b6bcd>] bio_endio+0x3d/0x90
> [<ffffffff81249873>] req_bio_endio+0xa3/0xe0
Ugh....
In fs/btrfs/volumes.c
static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
{
...
bio->bi_bdev = (struct block_device *)
(unsigned long)bbio->mirror_num;
...
}
static void btrfs_end_bio(struct bio *bio, int err)
{
...
bio->bi_bdev = (struct block_device *)
(unsigned long)bbio->mirror_num;
...
}
In fs/btrfs/extent_io.c
static void end_bio_extent_readpage(struct bio *bio, int err)
{
int mirror;
...
mirror = (int)(unsigned long)bio->bi_bdev;
...
}
Ewweeeeeeeeeeeeeeeeeehh........
No wonder this thing crashes. Chris, can't the original bio carry
bbio in bi_private and let end_bio_extent_readpage() free the bbio
instead of abusing bi_bdev like this?
--
tejun
On 04/19/2013 03:10 AM, Linus Torvalds wrote:
> On Thu, Apr 18, 2013 at 11:13 AM, Jens Axboe <[email protected]> wrote:
>> On Thu, Apr 18 2013, Tejun Heo wrote:
>>> On Thu, Apr 18, 2013 at 10:39:00AM -0700, Jens Axboe wrote:
>>>>
>>>> Yep, thanks Linus for that hint... Must be someone abusing it for a
>>>> flag field post submission? Crazy.
>>>
>>> Let's hope that's not the case because there'll be blood if it is. :)
>>
>> Yeah, it's beyond the amount of crazy I've come to expect from various
>> random users of IO interfaces :-)
>
> I think it's more likely to be some use-after-free after a long timeout.
>
> Wanlong says it happens a few minutes after boot, so maybe something
> times out a command, does the blk_complete_request(), and free's the
> bio, which gets re-used before the softirq actually ends up running.
>
> I note that Wanlong uses the SLAB allocator, not the SLUB one. I
> wonder if the thing goes away with SLUB, and if not, if
> CONFIG_SLUB_DEBUG_ON=y might help debug it?
Done as you mentioned, attach the config and dmesg before panic.
And the panic message almost the same as before here: https://lkml.org/lkml/2013/4/18/633
Thanks,
Wanlong Gao
>
> Linus
>
On Thu, Apr 18, 2013 at 10:57:54PM -0700, Tejun Heo wrote:
> No wonder this thing crashes. Chris, can't the original bio carry
> bbio in bi_private and let end_bio_extent_readpage() free the bbio
> instead of abusing bi_bdev like this?
BTW, I think it's a bit too late to fix this properly from btrfs side
unless we're gonna do -rc8, so let's revert the TP patch for now and
sort this out in the next devel cycle. AFAICS, while disturbingly
(ha|yu)cky, the bi_bdev trick should be okay without the new TP.
Thanks.
--
tejun
On 04/19/2013 02:17 PM, Tejun Heo wrote:
> On Thu, Apr 18, 2013 at 10:57:54PM -0700, Tejun Heo wrote:
>> No wonder this thing crashes. Chris, can't the original bio carry
>> bbio in bi_private and let end_bio_extent_readpage() free the bbio
>> instead of abusing bi_bdev like this?
>
> BTW, I think it's a bit too late to fix this properly from btrfs side
> unless we're gonna do -rc8, so let's revert the TP patch for now and
> sort this out in the next devel cycle. AFAICS, while disturbingly
> (ha|yu)cky, the bi_bdev trick should be okay without the new TP.
Thank you for investigating this. And sorry for my incompleted panic picture
before that let you detour. :-(
Regards,
Wanlong Gao
>
> Thanks.
>
On Fri, April 19, 2013 at 07:57 (+0200), Tejun Heo wrote:
> (cc'ing btrfs people)
>
> On Fri, Apr 19, 2013 at 11:33:20AM +0800, Wanlong Gao wrote:
>> RIP: 0010:[<ffffffff812484d3>] [<ffffffff812484d3>] ftrace_raw_event_block_bio_complete+0x73/0xf0
> ...
>> [<ffffffff811b6c10>] bio_endio+0x80/0x90
>> [<ffffffffa0790d26>] btrfs_end_bio+0xf6/0x190 [btrfs]
>> [<ffffffff811b6bcd>] bio_endio+0x3d/0x90
>> [<ffffffff81249873>] req_bio_endio+0xa3/0xe0
>
> Ugh....
>
> In fs/btrfs/volumes.c
>
> static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
> {
> ...
> bio->bi_bdev = (struct block_device *)
> (unsigned long)bbio->mirror_num;
> ...
> }
>
> static void btrfs_end_bio(struct bio *bio, int err)
> {
> ...
> bio->bi_bdev = (struct block_device *)
> (unsigned long)bbio->mirror_num;
>
> ...
> }
>
> In fs/btrfs/extent_io.c
>
> static void end_bio_extent_readpage(struct bio *bio, int err)
> {
> int mirror;
> ...
> mirror = (int)(unsigned long)bio->bi_bdev;
> ...
> }
>
> Ewweeeeeeeeeeeeeeeeeehh........
>
> No wonder this thing crashes. Chris, can't the original bio carry
> bbio in bi_private and let end_bio_extent_readpage() free the bbio
> instead of abusing bi_bdev like this?
Oops.
It's been my patch back in 2011 (commit 2774b2ca3), sent as an RFC-Patch and
just slipped in without further discussion of that exact change. Hackish, yes -
my reasoning was because the block layer changed bio->bi_bdev anyway, no one
would want to look into it after the bio returned (and in fact it didn't hurt
for like two years now). Although the block layer changes bi_bdev, it stays a
valid bdev pointer, I admit.
One way around this would be what you suggest, however that would mean the
caller of (btrfs|btree)_submit_bio_hook gets its completion called in the end,
but must know that the private is in fact a bbio which in turn carries the
caller's private. Doesn't sound clean to me, either.
The best idea I currently have is to add a dispatcher function that does the
freeing of bbio and calls the actual completion with mirror_num as a separate
parameter. That would make all the btrfs completions incompatible with
bio_end_io_t, but it shouldn't hurt.
At least now I know I wasn't invited to LSF for a good reason :-)
-Jan
Quoting Tejun Heo (2013-04-19 01:57:54)
>
> Ewweeeeeeeeeeeeeeeeeehh........
>
> No wonder this thing crashes. Chris, can't the original bio carry
> bbio in bi_private and let end_bio_extent_readpage() free the bbio
> instead of abusing bi_bdev like this?
Yes, we can definitely carry bbio up higher in the stack. I'll patch it
up right now. I do agree that it'll be too big for -final, but we'll
have it either way.
-chris
On Thu, Apr 18 2013, Tejun Heo wrote:
> On Thu, Apr 18, 2013 at 10:57:54PM -0700, Tejun Heo wrote:
> > No wonder this thing crashes. Chris, can't the original bio carry
> > bbio in bi_private and let end_bio_extent_readpage() free the bbio
> > instead of abusing bi_bdev like this?
>
> BTW, I think it's a bit too late to fix this properly from btrfs side
> unless we're gonna do -rc8, so let's revert the TP patch for now and
> sort this out in the next devel cycle. AFAICS, while disturbingly
> (ha|yu)cky, the bi_bdev trick should be okay without the new TP.
The tracing commit was already backed out yesterday, so no problems
there. I figured it'd be a bit too late to fix something this nasty in a
suitably clean and straight forward fashion.
--
Jens Axboe
On Thu, Apr 18 2013, Tejun Heo wrote:
> (cc'ing btrfs people)
>
> On Fri, Apr 19, 2013 at 11:33:20AM +0800, Wanlong Gao wrote:
> > RIP: 0010:[<ffffffff812484d3>] [<ffffffff812484d3>] ftrace_raw_event_block_bio_complete+0x73/0xf0
> ...
> > [<ffffffff811b6c10>] bio_endio+0x80/0x90
> > [<ffffffffa0790d26>] btrfs_end_bio+0xf6/0x190 [btrfs]
> > [<ffffffff811b6bcd>] bio_endio+0x3d/0x90
> > [<ffffffff81249873>] req_bio_endio+0xa3/0xe0
>
> Ugh....
>
> In fs/btrfs/volumes.c
>
> static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
> {
> ...
> bio->bi_bdev = (struct block_device *)
> (unsigned long)bbio->mirror_num;
> ...
> }
>
> static void btrfs_end_bio(struct bio *bio, int err)
> {
> ...
> bio->bi_bdev = (struct block_device *)
> (unsigned long)bbio->mirror_num;
>
> ...
> }
>
> In fs/btrfs/extent_io.c
>
> static void end_bio_extent_readpage(struct bio *bio, int err)
> {
> int mirror;
> ...
> mirror = (int)(unsigned long)bio->bi_bdev;
> ...
> }
>
> Ewweeeeeeeeeeeeeeeeeehh........
>
> No wonder this thing crashes. Chris, can't the original bio carry
> bbio in bi_private and let end_bio_extent_readpage() free the bbio
> instead of abusing bi_bdev like this?
Ugh, wtf.
Chris, time for a swim in the bay :-)
--
Jens Axboe
Quoting Jens Axboe (2013-04-19 09:32:50)
> >
> > No wonder this thing crashes. Chris, can't the original bio carry
> > bbio in bi_private and let end_bio_extent_readpage() free the bbio
> > instead of abusing bi_bdev like this?
>
> Ugh, wtf.
>
> Chris, time for a swim in the bay :-)
Yeah, I can't really defend this one. We needed a space for an int and
I assumed end_io meant the FS was free to do horrible things.
Really though, I'll just take a quick dip in the lake and patch this out
of btrfs.
Jan is probably right about changing around our endio callbacks to
explicitly pass the mirror, it should be less complex and cleaner.
Many thanks to everyone here that tracked it down.
-chris