2007-08-30 18:15:41

by Avantika Mathur

[permalink] [raw]
Subject: jbd2_journal_commit_transaction oops

Hi Girish,

When running fsstress on an x86 machine, I hit the kernel oops below. The ext4-patch-queue is being tested including the journal checksum patches
I have mounted with -o delalloc,mballoc,data=writeback,journal_checksum

I did not get a chance to look in detail; but it looks like a journal issue. Do you know what the issue might be?


BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c118ba5d
*pdpt = 000000002560a001
*pde = 0000000000000000
Oops: 0000 [#1]
SMP
Modules linked in:
CPU: 1
EIP: 0060:[<c118ba5d>] Not tainted VLI
EFLAGS: 00010257 (2.6.23-rc4-autokern1 #1)
EIP is at crc32_be+0x3d/0x9c
eax: 7e78a276 ebx: 76a2787e ecx: 00000400 edx: 00000000
esi: 00000000 edi: 00000000 ebp: f56e5200 esp: e61f9e90
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
Process kjournald2 (pid: 5388, ti=e61f8000 task=e3efc000 task.ti=e61f8000)
Stack: ef5fffc0 00000016 c10c762d 0000055e 00000000 00000000 00001000 00000000
f50e3e80 7e78a276 00000008 00000000 00000544 eb46aab4 eb46aabc 00000155
f585f800 e1e1c968 00000000 eb059428 0000055e 00000000 00000000 e3efc000
Call Trace:
[<c10c762d>] jbd2_journal_commit_transaction+0x92a/0x128d
[<c1029e51>] autoremove_wake_function+0x0/0x33
[<c1029e51>] autoremove_wake_function+0x0/0x33
[<c10214f4>] try_to_del_timer_sync+0x42/0x48
[<c10ca4fd>] kjournald2+0x130/0x307
[<c1029e51>] autoremove_wake_function+0x0/0x33
[<c129171c>] __sched_text_start+0x364/0x3ff
[<c1029e51>] autoremove_wake_function+0x0/0x33
[<c10ca3cd>] kjournald2+0x0/0x307
[<c1029a27>] kthread+0x34/0x55
[<c10299f3>] kthread+0x0/0x55
[<c1003173>] kernel_thread_helper+0x7/0x10
=======================
Code: 42 30 d8 0f b6 c0 c1 eb 08 33 1c 85 e0 ae 2a c1 49 74 05 f6 c2 03 75 e5 83 f9 03 76 4c 89 ce 83 ea 04 83 e6 03 c1 e9 02 83 c2 04 <33> 1a 0f b6 c3 c1 eb 08 33 1c 85 e0 ae 2a c1 0f b6 c3 c1 eb 08
EIP: [<c118ba5d>] crc32_be+0x3d/0x9c SS:ESP 0068:e61f9e90
-- 0:conmux-control -- time-stamp -- Aug/30/07 0:36:46 --
-- 0:conmux-control -- time-stamp -- Aug/30/07 5:56:32 --
(bot:conmon-payload) disconnected


thanks,
Avantika


2007-08-31 09:19:01

by Girish Shilamkar

[permalink] [raw]
Subject: Re: jbd2_journal_commit_transaction oops

Hi Avantika,
From initial code review I think oops in crc32_be is caused as
bh->b_data passed to the function is NULL. Most probably something might
have gone wrong in jbd2_journal_write_metadata_buffer() making bh->data
= 0.
Does the error go away, when run without journal_checksum patch/option ?

Regards,
Girish.

On Thu, 2007-08-30 at 11:16 -0700, Avantika Mathur wrote:
> Hi Girish,
>
> When running fsstress on an x86 machine, I hit the kernel oops below. The ext4-patch-queue is being tested including the journal checksum patches
> I have mounted with -o delalloc,mballoc,data=writeback,journal_checksum
>
> I did not get a chance to look in detail; but it looks like a journal issue. Do you know what the issue might be?
>
>
> BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> c118ba5d
> *pdpt = 000000002560a001
> *pde = 0000000000000000
> Oops: 0000 [#1]
> SMP
> Modules linked in:
> CPU: 1
> EIP: 0060:[<c118ba5d>] Not tainted VLI
> EFLAGS: 00010257 (2.6.23-rc4-autokern1 #1)
> EIP is at crc32_be+0x3d/0x9c
> eax: 7e78a276 ebx: 76a2787e ecx: 00000400 edx: 00000000
> esi: 00000000 edi: 00000000 ebp: f56e5200 esp: e61f9e90
> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
> Process kjournald2 (pid: 5388, ti=e61f8000 task=e3efc000 task.ti=e61f8000)
> Stack: ef5fffc0 00000016 c10c762d 0000055e 00000000 00000000 00001000 00000000
> f50e3e80 7e78a276 00000008 00000000 00000544 eb46aab4 eb46aabc 00000155
> f585f800 e1e1c968 00000000 eb059428 0000055e 00000000 00000000 e3efc000
> Call Trace:
> [<c10c762d>] jbd2_journal_commit_transaction+0x92a/0x128d
> [<c1029e51>] autoremove_wake_function+0x0/0x33
> [<c1029e51>] autoremove_wake_function+0x0/0x33
> [<c10214f4>] try_to_del_timer_sync+0x42/0x48
> [<c10ca4fd>] kjournald2+0x130/0x307
> [<c1029e51>] autoremove_wake_function+0x0/0x33
> [<c129171c>] __sched_text_start+0x364/0x3ff
> [<c1029e51>] autoremove_wake_function+0x0/0x33
> [<c10ca3cd>] kjournald2+0x0/0x307
> [<c1029a27>] kthread+0x34/0x55
> [<c10299f3>] kthread+0x0/0x55
> [<c1003173>] kernel_thread_helper+0x7/0x10
> =======================
> Code: 42 30 d8 0f b6 c0 c1 eb 08 33 1c 85 e0 ae 2a c1 49 74 05 f6 c2 03 75 e5 83 f9 03 76 4c 89 ce 83 ea 04 83 e6 03 c1 e9 02 83 c2 04 <33> 1a 0f b6 c3 c1 eb 08 33 1c 85 e0 ae 2a c1 0f b6 c3 c1 eb 08
> EIP: [<c118ba5d>] crc32_be+0x3d/0x9c SS:ESP 0068:e61f9e90
> -- 0:conmux-control -- time-stamp -- Aug/30/07 0:36:46 --
> -- 0:conmux-control -- time-stamp -- Aug/30/07 5:56:32 --
> (bot:conmon-payload) disconnected
>
>
> thanks,
> Avantika

2007-08-31 21:32:05

by Avantika Mathur

[permalink] [raw]
Subject: Re: jbd2_journal_commit_transaction oops

Girish Shilamkar wrote:
> Hi Avantika,
> From initial code review I think oops in crc32_be is caused as
> bh->b_data passed to the function is NULL. Most probably something might
> have gone wrong in jbd2_journal_write_metadata_buffer() making bh->data
> = 0.
> Does the error go away, when run without journal_checksum patch/option ?
>
>
>
I tried running fsstress without the journal_checksum option set and
didn't see any errors. But I got the oops again when I mounted with
journal_checksum.
thank you for looking into this.

Avantika
> On Thu, 2007-08-30 at 11:16 -0700, Avantika Mathur wrote:
>
>> Hi Girish,
>>
>> When running fsstress on an x86 machine, I hit the kernel oops below. The ext4-patch-queue is being tested including the journal checksum patches
>> I have mounted with -o delalloc,mballoc,data=writeback,journal_checksum
>>
>> I did not get a chance to look in detail; but it looks like a journal issue. Do you know what the issue might be?
>>
>>
>> BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
>> printing eip:
>> c118ba5d
>> *pdpt = 000000002560a001
>> *pde = 0000000000000000
>> Oops: 0000 [#1]
>> SMP
>> Modules linked in:
>> CPU: 1
>> EIP: 0060:[<c118ba5d>] Not tainted VLI
>> EFLAGS: 00010257 (2.6.23-rc4-autokern1 #1)
>> EIP is at crc32_be+0x3d/0x9c
>> eax: 7e78a276 ebx: 76a2787e ecx: 00000400 edx: 00000000
>> esi: 00000000 edi: 00000000 ebp: f56e5200 esp: e61f9e90
>> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
>> Process kjournald2 (pid: 5388, ti=e61f8000 task=e3efc000 task.ti=e61f8000)
>> Stack: ef5fffc0 00000016 c10c762d 0000055e 00000000 00000000 00001000 00000000
>> f50e3e80 7e78a276 00000008 00000000 00000544 eb46aab4 eb46aabc 00000155
>> f585f800 e1e1c968 00000000 eb059428 0000055e 00000000 00000000 e3efc000
>> Call Trace:
>> [<c10c762d>] jbd2_journal_commit_transaction+0x92a/0x128d
>> [<c1029e51>] autoremove_wake_function+0x0/0x33
>> [<c1029e51>] autoremove_wake_function+0x0/0x33
>> [<c10214f4>] try_to_del_timer_sync+0x42/0x48
>> [<c10ca4fd>] kjournald2+0x130/0x307
>> [<c1029e51>] autoremove_wake_function+0x0/0x33
>> [<c129171c>] __sched_text_start+0x364/0x3ff
>> [<c1029e51>] autoremove_wake_function+0x0/0x33
>> [<c10ca3cd>] kjournald2+0x0/0x307
>> [<c1029a27>] kthread+0x34/0x55
>> [<c10299f3>] kthread+0x0/0x55
>> [<c1003173>] kernel_thread_helper+0x7/0x10
>> =======================
>> Code: 42 30 d8 0f b6 c0 c1 eb 08 33 1c 85 e0 ae 2a c1 49 74 05 f6 c2 03 75 e5 83 f9 03 76 4c 89 ce 83 ea 04 83 e6 03 c1 e9 02 83 c2 04 <33> 1a 0f b6 c3 c1 eb 08 33 1c 85 e0 ae 2a c1 0f b6 c3 c1 eb 08
>> EIP: [<c118ba5d>] crc32_be+0x3d/0x9c SS:ESP 0068:e61f9e90
>> -- 0:conmux-control -- time-stamp -- Aug/30/07 0:36:46 --
>> -- 0:conmux-control -- time-stamp -- Aug/30/07 5:56:32 --
>> (bot:conmon-payload) disconnected
>>
>>
>> thanks,
>> Avantika
>>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2007-09-13 09:49:48

by Girish Shilamkar

[permalink] [raw]
Subject: Re: jbd2_journal_commit_transaction oops

Hi,
Avantika and I had been working on this and we found
that this problem is only seen on numaq machines. Avantika plans to
run few more tests on numaq machines to gather more information.

Regards,
Girish.

On 9/1/07, Avantika Mathur <[email protected]> wrote:
> Girish Shilamkar wrote:
> > Hi Avantika,
> > From initial code review I think oops in crc32_be is caused as
> > bh->b_data passed to the function is NULL. Most probably something might
> > have gone wrong in jbd2_journal_write_metadata_buffer() making bh->data
> > = 0.
> > Does the error go away, when run without journal_checksum patch/option ?
> >
> >
> >
> I tried running fsstress without the journal_checksum option set and
> didn't see any errors. But I got the oops again when I mounted with
> journal_checksum.
> thank you for looking into this.
>
> Avantika
> > On Thu, 2007-08-30 at 11:16 -0700, Avantika Mathur wrote:
> >
> >> Hi Girish,
> >>
> >> When running fsstress on an x86 machine, I hit the kernel oops below. The ext4-patch-queue is being tested including the journal checksum patches
> >> I have mounted with -o delalloc,mballoc,data=writeback,journal_checksum
> >>
> >> I did not get a chance to look in detail; but it looks like a journal issue. Do you know what the issue might be?
> >>
> >>
> >> BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
> >> printing eip:
> >> c118ba5d
> >> *pdpt = 000000002560a001
> >> *pde = 0000000000000000
> >> Oops: 0000 [#1]
> >> SMP
> >> Modules linked in:
> >> CPU: 1
> >> EIP: 0060:[<c118ba5d>] Not tainted VLI
> >> EFLAGS: 00010257 (2.6.23-rc4-autokern1 #1)
> >> EIP is at crc32_be+0x3d/0x9c
> >> eax: 7e78a276 ebx: 76a2787e ecx: 00000400 edx: 00000000
> >> esi: 00000000 edi: 00000000 ebp: f56e5200 esp: e61f9e90
> >> ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
> >> Process kjournald2 (pid: 5388, ti=e61f8000 task=e3efc000 task.ti=e61f8000)
> >> Stack: ef5fffc0 00000016 c10c762d 0000055e 00000000 00000000 00001000 00000000
> >> f50e3e80 7e78a276 00000008 00000000 00000544 eb46aab4 eb46aabc 00000155
> >> f585f800 e1e1c968 00000000 eb059428 0000055e 00000000 00000000 e3efc000
> >> Call Trace:
> >> [<c10c762d>] jbd2_journal_commit_transaction+0x92a/0x128d
> >> [<c1029e51>] autoremove_wake_function+0x0/0x33
> >> [<c1029e51>] autoremove_wake_function+0x0/0x33
> >> [<c10214f4>] try_to_del_timer_sync+0x42/0x48
> >> [<c10ca4fd>] kjournald2+0x130/0x307
> >> [<c1029e51>] autoremove_wake_function+0x0/0x33
> >> [<c129171c>] __sched_text_start+0x364/0x3ff
> >> [<c1029e51>] autoremove_wake_function+0x0/0x33
> >> [<c10ca3cd>] kjournald2+0x0/0x307
> >> [<c1029a27>] kthread+0x34/0x55
> >> [<c10299f3>] kthread+0x0/0x55
> >> [<c1003173>] kernel_thread_helper+0x7/0x10
> >> =======================
> >> Code: 42 30 d8 0f b6 c0 c1 eb 08 33 1c 85 e0 ae 2a c1 49 74 05 f6 c2 03 75 e5 83 f9 03 76 4c 89 ce 83 ea 04 83 e6 03 c1 e9 02 83 c2 04 <33> 1a 0f b6 c3 c1 eb 08 33 1c 85 e0 ae 2a c1 0f b6 c3 c1 eb 08
> >> EIP: [<c118ba5d>] crc32_be+0x3d/0x9c SS:ESP 0068:e61f9e90
> >> -- 0:conmux-control -- time-stamp -- Aug/30/07 0:36:46 --
> >> -- 0:conmux-control -- time-stamp -- Aug/30/07 5:56:32 --
> >> (bot:conmon-payload) disconnected
> >>
> >>
> >> thanks,
> >> Avantika
> >>
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>

2007-11-02 00:40:39

by Mingming Cao

[permalink] [raw]
Subject: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

JBD2: Fix NULL pointer bh->b_data on NUMA box with journal checksumming.

Current journal checksumming patch failed fsstress test on NUMA. The
bh->b_data passed to the crc32_be () function could be NULL pointer,
which caused kernel oops immediately when running fsstress with -o
journal_checksum. It is because the page is part of highmem on NUMA box.
We need to kmap the page before access the bh->b_data to calculate
the checksums.

Signed-off-by: Mingming Cao <[email protected]>

---
fs/jbd2/commit.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6.24-rc1/fs/jbd2/commit.c
===================================================================
--- linux-2.6.24-rc1.orig/fs/jbd2/commit.c 2007-11-01 11:15:08.000000000 -0700
+++ linux-2.6.24-rc1/fs/jbd2/commit.c 2007-11-01 12:27:02.000000000 -0700
@@ -352,6 +352,20 @@ write_out_data:
journal_do_submit_data(wbuf, bufs);
}

+static inline __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+{
+ struct page *page = bh->b_page;
+ char *addr;
+ __u32 checksum;
+
+ addr = kmap(page);
+ checksum = crc32_be(crc32_sum,
+ (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
+ kunmap(page);
+
+ return checksum;
+}
+
static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
unsigned long long block)
{
@@ -715,9 +729,8 @@ start_journal_io:
*/
if (JBD2_HAS_COMPAT_FEATURE(journal,
JBD2_FEATURE_COMPAT_CHECKSUM)) {
- crc32_sum = crc32_be(crc32_sum,
- (void *)bh->b_data,
- bh->b_size);
+ crc32_sum =
+ jbd2_checksum_data(crc32_sum, bh);
}

lock_buffer(bh);

2007-11-02 05:21:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Nov 01, 2007 17:40 -0700, Mingming Cao wrote:
> Current journal checksumming patch failed fsstress test on NUMA. The
> bh->b_data passed to the crc32_be () function could be NULL pointer,
> which caused kernel oops immediately when running fsstress with -o
> journal_checksum. It is because the page is part of highmem on NUMA box.
> We need to kmap the page before access the bh->b_data to calculate
> the checksums.

I have no objection to the patch, per-se, but I'm surprised that there
would ever be a buffer head pointing at a page in high memory? That
seems contrary to what I would expect...

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2007-11-02 15:29:41

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote:
> On Nov 01, 2007 17:40 -0700, Mingming Cao wrote:
> > Current journal checksumming patch failed fsstress test on NUMA. The
> > bh->b_data passed to the crc32_be () function could be NULL pointer,
> > which caused kernel oops immediately when running fsstress with -o
> > journal_checksum. It is because the page is part of highmem on NUMA box.
> > We need to kmap the page before access the bh->b_data to calculate
> > the checksums.
>
> I have no objection to the patch, per-se, but I'm surprised that there
> would ever be a buffer head pointing at a page in high memory? That
> seems contrary to what I would expect...

I was surprised to see that too while helping Mingming/Avantika track
this issue. I was under impression that we are checksumming only
metadata and it should be lowmem. But only "buffer_head"s are in lowmem.
Pages that point to can be in Highmem.

Thanks,
Badari

2007-11-03 01:37:14

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Nov 02, 2007 08:31 -0800, Badari Pulavarty wrote:
> On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote:
> > On Nov 01, 2007 17:40 -0700, Mingming Cao wrote:
> > > Current journal checksumming patch failed fsstress test on NUMA. The
> > > bh->b_data passed to the crc32_be () function could be NULL pointer,
> > > which caused kernel oops immediately when running fsstress with -o
> > > journal_checksum. It is because the page is part of highmem on NUMA box.
> > > We need to kmap the page before access the bh->b_data to calculate
> > > the checksums.
> >
> > I have no objection to the patch, per-se, but I'm surprised that there
> > would ever be a buffer head pointing at a page in high memory? That
> > seems contrary to what I would expect...
>
> I was surprised to see that too while helping Mingming/Avantika track
> this issue. I was under impression that we are checksumming only
> metadata and it should be lowmem. But only "buffer_head"s are in lowmem.
> Pages that point to can be in Highmem.

But... this implies that every user of bh->b_data needs to kmap, and I
don't see that in the code anywhere else. That makes me think something
else is going wrong here.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2007-11-05 16:03:37

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote:
> On Nov 02, 2007 08:31 -0800, Badari Pulavarty wrote:
> > On Fri, 2007-11-02 at 13:20 +0800, Andreas Dilger wrote:
> > > On Nov 01, 2007 17:40 -0700, Mingming Cao wrote:
> > > > Current journal checksumming patch failed fsstress test on NUMA. The
> > > > bh->b_data passed to the crc32_be () function could be NULL pointer,
> > > > which caused kernel oops immediately when running fsstress with -o
> > > > journal_checksum. It is because the page is part of highmem on NUMA box.
> > > > We need to kmap the page before access the bh->b_data to calculate
> > > > the checksums.
> > >
> > > I have no objection to the patch, per-se, but I'm surprised that there
> > > would ever be a buffer head pointing at a page in high memory? That
> > > seems contrary to what I would expect...
> >
> > I was surprised to see that too while helping Mingming/Avantika track
> > this issue. I was under impression that we are checksumming only
> > metadata and it should be lowmem. But only "buffer_head"s are in lowmem.
> > Pages that point to can be in Highmem.
>
> But... this implies that every user of bh->b_data needs to kmap, and I
> don't see that in the code anywhere else. That makes me think something
> else is going wrong here.

Most cases, this is handled in ll_rw_block() code - when we submit the
buffer head for IO. If the page is in highmem, we will end up creating
a bounce bufer for it.

In our case, JBD code is trying to look at the data to do checksum
on it. Thats why we have to kmap() the page before looking.

Thanks,
Badari

2007-11-05 16:15:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Nov 05, 2007 08:04 -0800, Badari Pulavarty wrote:
> On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote:
> > But... this implies that every user of bh->b_data needs to kmap, and I
> > don't see that in the code anywhere else. That makes me think something
> > else is going wrong here.
>
> Most cases, this is handled in ll_rw_block() code - when we submit the
> buffer head for IO. If the page is in highmem, we will end up creating
> a bounce bufer for it.
>
> In our case, JBD code is trying to look at the data to do checksum
> on it. Thats why we have to kmap() the page before looking.

My point is that there is a LOT of code in ext[234] that dereferences
bh->b_data without kmap() (e.g. group descriptors, bitmaps, superblock,
inode tables, etc). Does that imply that something is forcing those
bh pages into lowmem, or is the journal bh page in question being
allocated in some different way that allows it to be in highmem?

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2007-11-05 18:06:35

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Tue, 2007-11-06 at 00:15 +0800, Andreas Dilger wrote:
> On Nov 05, 2007 08:04 -0800, Badari Pulavarty wrote:
> > On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote:
> > > But... this implies that every user of bh->b_data needs to kmap, and I
> > > don't see that in the code anywhere else. That makes me think something
> > > else is going wrong here.
> >
> > Most cases, this is handled in ll_rw_block() code - when we submit the
> > buffer head for IO. If the page is in highmem, we will end up creating
> > a bounce bufer for it.
> >
> > In our case, JBD code is trying to look at the data to do checksum
> > on it. Thats why we have to kmap() the page before looking.
>
> My point is that there is a LOT of code in ext[234] that dereferences
> bh->b_data without kmap() (e.g. group descriptors, bitmaps, superblock,
> inode tables, etc). Does that imply that something is forcing those
> bh pages into lowmem, or is the journal bh page in question being
> allocated in some different way that allows it to be in highmem?

Yes. You are right. Its been a while since I had to deal with HIGHMEM.
All the meta-data should be in LOWMEM. I asked Mingming to verify
what the buffer-head is pointing to when it has HIGHMEM page.

Thanks,
Badari

2007-11-05 23:21:38

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Mon, 2007-11-05 at 10:07 -0800, Badari Pulavarty wrote:
> On Tue, 2007-11-06 at 00:15 +0800, Andreas Dilger wrote:
> > On Nov 05, 2007 08:04 -0800, Badari Pulavarty wrote:
> > > On Sat, 2007-11-03 at 09:36 +0800, Andreas Dilger wrote:
> > > > But... this implies that every user of bh->b_data needs to kmap, and I
> > > > don't see that in the code anywhere else. That makes me think something
> > > > else is going wrong here.
> > >
> > > Most cases, this is handled in ll_rw_block() code - when we submit the
> > > buffer head for IO. If the page is in highmem, we will end up creating
> > > a bounce bufer for it.
> > >
> > > In our case, JBD code is trying to look at the data to do checksum
> > > on it. Thats why we have to kmap() the page before looking.
> >
> > My point is that there is a LOT of code in ext[234] that dereferences
> > bh->b_data without kmap() (e.g. group descriptors, bitmaps, superblock,
> > inode tables, etc). Does that imply that something is forcing those
> > bh pages into lowmem, or is the journal bh page in question being
> > allocated in some different way that allows it to be in highmem?
>
> Yes. You are right. Its been a while since I had to deal with HIGHMEM.
> All the meta-data should be in LOWMEM. I asked Mingming to verify
> what the buffer-head is pointing to when it has HIGHMEM page.
>

The buffer_heads with NULL bh->b_data(under the "start_journal_io"
branch in jbd2_journal_commit_transaction() code) is created by
jbd2_journal_write_metadata_buffer().

Noticed that in jbd2_journal_write_metadata_buffer(), there are
multiple places which do kmap_atomic() to access the journal bh page
(new_page). In the normal case the new_page is pointing to the bh
pages, which(the page) was initially allocated by _page_cache_alloc()
(sb_bread->__bread()->_...>find_or_create_page()->_page_cache_alloc()

In the case it need a data copy (the buffer start with the
JBD2_MAGIC_NUMBER?), a new page is allocated by by
__get_free_pages()(via jbd2_alloc, which is possible allocated in
highmem. __get_free_pages calls alloc_pages() directly, doesn't seem to
have highmem handling like __page_cache_alloc().

I am not sure why we saw this issue on 2.6.23 kernel, where
jbd2_slab_alloc()->kmem_cache_alloc() is used. Isn't all slab pages
under lowmem?


Regards,

Mingming

2007-11-06 01:33:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC]JBD2: Fix journal checksum kernel oops on NUMA

On Nov 05, 2007 15:21 -0800, Mingming Cao wrote:
> On Mon, 2007-11-05 at 10:07 -0800, Badari Pulavarty wrote:
> > On Tue, 2007-11-06 at 00:15 +0800, Andreas Dilger wrote:
> > > My point is that there is a LOT of code in ext[234] that dereferences
> > > bh->b_data without kmap() (e.g. group descriptors, bitmaps, superblock,
> > > inode tables, etc). Does that imply that something is forcing those
> > > bh pages into lowmem, or is the journal bh page in question being
> > > allocated in some different way that allows it to be in highmem?
> >
> > Yes. You are right. Its been a while since I had to deal with HIGHMEM.
> > All the meta-data should be in LOWMEM. I asked Mingming to verify
> > what the buffer-head is pointing to when it has HIGHMEM page.
> >
>
> The buffer_heads with NULL bh->b_data(under the "start_journal_io"
> branch in jbd2_journal_commit_transaction() code) is created by
> jbd2_journal_write_metadata_buffer().
>
> Noticed that in jbd2_journal_write_metadata_buffer(), there are
> multiple places which do kmap_atomic() to access the journal bh page
> (new_page). In the normal case the new_page is pointing to the bh
> pages, which(the page) was initially allocated by _page_cache_alloc()
> (sb_bread->__bread()->_...>find_or_create_page()->_page_cache_alloc()
>
> In the case it need a data copy (the buffer start with the
> JBD2_MAGIC_NUMBER?), a new page is allocated by by
> __get_free_pages()(via jbd2_alloc, which is possible allocated in
> highmem. __get_free_pages calls alloc_pages() directly, doesn't seem to
> have highmem handling like __page_cache_alloc().

So long as there is a good explanation, and the code in jbd is expecting
to kmap() the b_data pages always then I have no objection to the patch.
I was just worried there was some other kind of bug involved here and
wanted to ensure that the root cause was understood. It might be prudent
to grep for b_data in the jbd2 code to verify there are no other places
that dereference the bh page without kmap first.

Thanks for the investigation Mingming.

Girish, can you please include this fix into our patch series.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.