2007-01-20 12:23:43

by Justin Piszcz

[permalink] [raw]
Subject: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

My .config is attached, please let me know if any other information is
needed and please CC (lkml) as I am not on the list, thanks!

Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
the RAID5 running XFS.

Any idea what happened here?

[473795.214705] BUG: unable to handle kernel paging request at virtual
address fffb92b0
[473795.214715] printing eip:
[473795.214718] c0358b14
[473795.214721] *pde = 00003067
[473795.214723] *pte = 00000000
[473795.214726] Oops: 0000 [#1]
[473795.214729] PREEMPT SMP
[473795.214736] CPU: 0
[473795.214737] EIP: 0060:[<c0358b14>] Not tainted VLI
[473795.214738] EFLAGS: 00010286 (2.6.19.2 #1)
[473795.214746] EIP is at copy_data+0x6c/0x179
[473795.214750] eax: 00000000 ebx: 00001000 ecx: 00000354 edx: fffb9000
[473795.214754] esi: fffb92b0 edi: da86c2b0 ebp: 00001000 esp: f7927dc4
[473795.214757] ds: 007b es: 007b ss: 0068
[473795.214761] Process md4_raid5 (pid: 1305, ti=f7926000 task=f7ea9030 task.ti=f7926000)
[473795.214765] Stack: c1ba7c40 00000003 f5538c80 00000001 da86c000 00000009 00000000 0000006c
[473795.214790] 00001000 da8536a8 aa6fee90 f5538c80 00000190 c0358d00 aa6fee88 0000ffff
[473795.214863] d7c5794c 00000001 da853488 f6fbec70 f6fbebc0 00000001 00000005 00000001
[473795.214876] Call Trace:
[473795.214880] [<c0358d00>] compute_parity5+0xdf/0x497
[473795.214887] [<c035b0dd>] handle_stripe+0x930/0x2986
[473795.214892] [<c01146b9>] find_busiest_group+0x124/0x4fd
[473795.214898] [<c03580e0>] release_stripe+0x21/0x2e
[473795.214902] [<c035d233>] raid5d+0x100/0x161
[473795.214907] [<c036b03c>] md_thread+0x40/0x103
[473795.214912] [<c012dbbe>] autoremove_wake_function+0x0/0x4b
[473795.214917] [<c036affc>] md_thread+0x0/0x103
[473795.214922] [<c012da1a>] kthread+0xfc/0x100
[473795.214926] [<c012d91e>] kthread+0x0/0x100
[473795.214930] [<c0103b4b>] kernel_thread_helper+0x7/0x1c
[473795.214935] =======================
[473795.214938] Code: 14 39 d1 0f 8d 10 01 00 00 89 c8 01 c0 01 c8 01 c0
01 c0 89 44 24 1c eb 51 89 d9 c1 e9 02 8b 7c 24 10 01 f7 8b 44 24 18 8d 34
02 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 c7 44 24 04 03 00 00 00 89 14
[473795.215017] EIP: [<c0358b14>] copy_data+0x6c/0x179 SS:ESP
0068:f7927dc4
[473795.215024] <6>note: md4_raid5[1305] exited with preempt_count 2

# mdadm -D /dev/md4
/dev/md4:
Version : 01.00.03
Creation Time : Wed Jan 10 15:58:52 2007
Raid Level : raid5
Array Size : 1562834432 (1490.44 GiB 1600.34 GB)
Device Size : 781417216 (372.61 GiB 400.09 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 4
Persistence : Superblock is persistent

Update Time : Sat Jan 20 07:15:01 2007
State : active
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 128K

Name : 4
UUID : 7f453e18:893e4dd9:6e810372:4c724f49
Events : 33

Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 81 1 active sync /dev/sdf1
2 8 113 2 active sync /dev/sdh1
3 8 65 3 active sync /dev/sde1
5 8 49 4 active sync /dev/sdd1


Attachments:
config.bz2 (7.05 kB)

2007-01-20 12:46:36

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)



On Sat, 20 Jan 2007, Justin Piszcz wrote:

> My .config is attached, please let me know if any other information is
> needed and please CC (lkml) as I am not on the list, thanks!
>
> Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> the RAID5 running XFS.
>
> Any idea what happened here?
>

It happened again under heavy read I/O when I was running md5sum -c on
some of my files.

[ 551.942958] BUG: unable to handle kernel paging request at virtual address fffb97b0
[ 551.942970] printing eip:
[ 551.942972] c0358bd8
[ 551.942974] *pde = 00003067
[ 551.942976] *pte = 00000000
[ 551.942980] Oops: 0002 [#1]
[ 551.942982] PREEMPT SMP
[ 551.942989] CPU: 0
[ 551.942990] EIP: 0060:[<c0358bd8>] Not tainted VLI
[ 551.942991] EFLAGS: 00010286 (2.6.19.2 #1)
[ 551.942999] EIP is at copy_data+0x130/0x179
[ 551.943001] eax: 00000000 ebx: 00001000 ecx: 00000214 edx: fffb9000
[ 551.943005] esi: dd2007b0 edi: fffb97b0 ebp: 00001000 esp: f76ffe1c
[ 551.943007] ds: 007b es: 007b ss: 0068
[ 551.943011] Process md4_raid5 (pid: 1309, ti=f76fe000 task=f7081560 task.ti=f76fe000)
[ 551.943013] Stack: c1d880c0 00000003 cd2f0540 00000000 dd200000 0000000e 00000000 000000a8
[ 551.943027] 00001000 cd2f0540 dd1f1adc f6435c48 dd1f1ad8 c035a977 34f3db20 c027be16
[ 551.943043] c0553328 00000002 00000002 c01146b9 f6435c48 c0553328 f6435c48 dd1f193c
[ 551.943056] Call Trace:
[ 551.943059] [<c035a977>] handle_stripe+0x1ca/0x2986
[ 551.943065] [<c027be16>] __next_cpu+0x22/0x33
[ 551.943072] [<c01146b9>] find_busiest_group+0x124/0x4fd
[ 551.943136] [<c01140af>] __wake_up+0x32/0x43
[ 551.943140] [<c03580e0>] release_stripe+0x21/0x2e
[ 551.943145] [<c035d233>] raid5d+0x100/0x161
[ 551.943150] [<c036b03c>] md_thread+0x40/0x103
[ 551.943155] [<c012dbbe>] autoremove_wake_function+0x0/0x4b
[ 551.943160] [<c036affc>] md_thread+0x0/0x103
[ 551.943165] [<c012da1a>] kthread+0xfc/0x100
[ 551.943169] [<c012d91e>] kthread+0x0/0x100
[ 551.943173] [<c0103b4b>] kernel_thread_helper+0x7/0x1c
[ 551.943178] =======================
[ 551.943180] Code: 8b 4c 24 08 8b 41 2c 8b 4c 24 1c 03 54 08 08 8b 44 24
0c 85 c0 0f 85 3a ff ff ff 89 d9 c1 e9 02 8b 44 24 18 8d 3c 02 03 74 24 10
<f3> a5 89 d9 83 e1 03 74 02 f3 a4 e9 37 ff ff ff 01 ee 89 74 24
[ 551.943254] EIP: [<c0358bd8>] copy_data+0x130/0x179 SS:ESP 0068:f76ffe1c
[ 551.943262] <6>note: md4_raid5[1309] exited with preempt_count 3

I will run resync/check on this array and then see if that fixes it.

Justin.

2007-01-22 21:02:14

by Chuck Ebbert

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

Justin Piszcz wrote:
> My .config is attached, please let me know if any other information is
> needed and please CC (lkml) as I am not on the list, thanks!
>
> Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> the RAID5 running XFS.
>
> Any idea what happened here?
>
> [473795.214705] BUG: unable to handle kernel paging request at virtual
> address fffb92b0
> [473795.214715] printing eip:
> [473795.214718] c0358b14
> [473795.214721] *pde = 00003067
> [473795.214723] *pte = 00000000
> [473795.214726] Oops: 0000 [#1]
> [473795.214729] PREEMPT SMP
> [473795.214736] CPU: 0
> [473795.214737] EIP: 0060:[<c0358b14>] Not tainted VLI
> [473795.214738] EFLAGS: 00010286 (2.6.19.2 #1)
> [473795.214746] EIP is at copy_data+0x6c/0x179
> [473795.214750] eax: 00000000 ebx: 00001000 ecx: 00000354 edx: fffb9000
> [473795.214754] esi: fffb92b0 edi: da86c2b0 ebp: 00001000 esp: f7927dc4
> [473795.214757] ds: 007b es: 007b ss: 0068
> [473795.214761] Process md4_raid5 (pid: 1305, ti=f7926000 task=f7ea9030 task.ti=f7926000)
> [473795.214765] Stack: c1ba7c40 00000003 f5538c80 00000001 da86c000 00000009 00000000 0000006c
> [473795.214790] 00001000 da8536a8 aa6fee90 f5538c80 00000190 c0358d00 aa6fee88 0000ffff
> [473795.214863] d7c5794c 00000001 da853488 f6fbec70 f6fbebc0 00000001 00000005 00000001
> [473795.214876] Call Trace:
> [473795.214880] [<c0358d00>] compute_parity5+0xdf/0x497
> [473795.214887] [<c035b0dd>] handle_stripe+0x930/0x2986
> [473795.214892] [<c01146b9>] find_busiest_group+0x124/0x4fd
> [473795.214898] [<c03580e0>] release_stripe+0x21/0x2e
> [473795.214902] [<c035d233>] raid5d+0x100/0x161
> [473795.214907] [<c036b03c>] md_thread+0x40/0x103
> [473795.214912] [<c012dbbe>] autoremove_wake_function+0x0/0x4b
> [473795.214917] [<c036affc>] md_thread+0x0/0x103
> [473795.214922] [<c012da1a>] kthread+0xfc/0x100
> [473795.214926] [<c012d91e>] kthread+0x0/0x100
> [473795.214930] [<c0103b4b>] kernel_thread_helper+0x7/0x1c
> [473795.214935] =======================
> [473795.214938] Code: 14 39 d1 0f 8d 10 01 00 00 89 c8 01 c0 01 c8 01 c0
> 01 c0 89 44 24 1c eb 51 89 d9 c1 e9 02 8b 7c 24 10 01 f7 8b 44 24 18 8d 34
> 02 <f3> a5 89 d9 83 e1 03 74 02 f3 a4 c7 44 24 04 03 00 00 00 89 14
> [473795.215017] EIP: [<c0358b14>] copy_data+0x6c/0x179 SS:ESP
> 0068:f7927dc4
>
Without digging too deeply, I'd say you've hit the same bug Sami Farin
and others
have reported starting with 2.6.19: pages mapped with kmap_atomic()
become unmapped
during memcpy() or similar operations. Try disabling preempt -- that
seems to be the
common factor.


2007-01-22 22:00:05

by NeilBrown

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

On Monday January 22, [email protected] wrote:
> Justin Piszcz wrote:
> > My .config is attached, please let me know if any other information is
> > needed and please CC (lkml) as I am not on the list, thanks!
> >
> > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> > the RAID5 running XFS.
> >
> > Any idea what happened here?
....
> >
> Without digging too deeply, I'd say you've hit the same bug Sami Farin
> and others
> have reported starting with 2.6.19: pages mapped with kmap_atomic()
> become unmapped
> during memcpy() or similar operations. Try disabling preempt -- that
> seems to be the
> common factor.

That is exactly the conclusion I had just come to (a kmap_atomic page
must be being unmapped during memcpy). I wasn't aware that others had
reported it - thanks for that.

Turning off CONFIG_PREEMPT certainly seems like a good idea.

NeilBrown

2007-01-23 01:44:15

by Dan Williams

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

On 1/22/07, Neil Brown <[email protected]> wrote:
> On Monday January 22, [email protected] wrote:
> > Justin Piszcz wrote:
> > > My .config is attached, please let me know if any other information is
> > > needed and please CC (lkml) as I am not on the list, thanks!
> > >
> > > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> > > the RAID5 running XFS.
> > >
> > > Any idea what happened here?
> ....
> > >
> > Without digging too deeply, I'd say you've hit the same bug Sami Farin
> > and others
> > have reported starting with 2.6.19: pages mapped with kmap_atomic()
> > become unmapped
> > during memcpy() or similar operations. Try disabling preempt -- that
> > seems to be the
> > common factor.
>
> That is exactly the conclusion I had just come to (a kmap_atomic page
> must be being unmapped during memcpy). I wasn't aware that others had
> reported it - thanks for that.
>
> Turning off CONFIG_PREEMPT certainly seems like a good idea.
>
Coming from an ARM background I am not yet versed in the inner
workings of kmap_atomic, but if you have time for a question I am
curious as to why spin_lock(&sh->lock) is not sufficient pre-emption
protection for copy_data() in this case?

> NeilBrown

Regards,
Dan

2007-01-23 02:07:16

by NeilBrown

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

On Monday January 22, [email protected] wrote:
> On 1/22/07, Neil Brown <[email protected]> wrote:
> > On Monday January 22, [email protected] wrote:
> > > Justin Piszcz wrote:
> > > > My .config is attached, please let me know if any other information is
> > > > needed and please CC (lkml) as I am not on the list, thanks!
> > > >
> > > > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> > > > the RAID5 running XFS.
> > > >
> > > > Any idea what happened here?
> > ....
> > > >
> > > Without digging too deeply, I'd say you've hit the same bug Sami Farin
> > > and others
> > > have reported starting with 2.6.19: pages mapped with kmap_atomic()
> > > become unmapped
> > > during memcpy() or similar operations. Try disabling preempt -- that
> > > seems to be the
> > > common factor.
> >
> > That is exactly the conclusion I had just come to (a kmap_atomic page
> > must be being unmapped during memcpy). I wasn't aware that others had
> > reported it - thanks for that.
> >
> > Turning off CONFIG_PREEMPT certainly seems like a good idea.
> >
> Coming from an ARM background I am not yet versed in the inner
> workings of kmap_atomic, but if you have time for a question I am
> curious as to why spin_lock(&sh->lock) is not sufficient pre-emption
> protection for copy_data() in this case?
>

Presumably there is a bug somewhere.
kmap_atomic itself calls inc_preempt_count so that preemption should
be disabled at least until the kunmap_atomic is called.

But apparently not. The symptoms point exactly to the page getting
unmapped when it shouldn't. Until that bug is found and fixed, the
work around of turning of CONFIG_PREEMPT seems to make sense.

Of course it would be great if someone who can easily reproduce this
bug could do the 'git bisect' thing to find out where the bug crept
in.....

NeilBrown

2007-01-23 10:56:30

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)



On Tue, 23 Jan 2007, Neil Brown wrote:

> On Monday January 22, [email protected] wrote:
> > Justin Piszcz wrote:
> > > My .config is attached, please let me know if any other information is
> > > needed and please CC (lkml) as I am not on the list, thanks!
> > >
> > > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> > > the RAID5 running XFS.
> > >
> > > Any idea what happened here?
> ....
> > >
> > Without digging too deeply, I'd say you've hit the same bug Sami Farin
> > and others
> > have reported starting with 2.6.19: pages mapped with kmap_atomic()
> > become unmapped
> > during memcpy() or similar operations. Try disabling preempt -- that
> > seems to be the
> > common factor.
>
> That is exactly the conclusion I had just come to (a kmap_atomic page
> must be being unmapped during memcpy). I wasn't aware that others had
> reported it - thanks for that.
>
> Turning off CONFIG_PREEMPT certainly seems like a good idea.
>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

Is this a bug that can or will be fixed or should I disable pre-emption on
critical and/or server machines?

Justin.

2007-01-23 11:08:36

by Michael Tokarev

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

Justin Piszcz wrote:
[]
> Is this a bug that can or will be fixed or should I disable pre-emption on
> critical and/or server machines?

Disabling pre-emption on critical and/or server machines seems to be a good
idea in the first place. IMHO anyway.. ;)

/mjt

2007-01-23 11:59:21

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)



On Tue, 23 Jan 2007, Michael Tokarev wrote:

> Justin Piszcz wrote:
> []
> > Is this a bug that can or will be fixed or should I disable pre-emption on
> > critical and/or server machines?
>
> Disabling pre-emption on critical and/or server machines seems to be a good
> idea in the first place. IMHO anyway.. ;)
>
> /mjt
>

So for a server system, the following options should be as follows:

Preemption Model (No Forced Preemption (Server)) --->
[ ] Preempt The Big Kernel Lock

Also, my mobo has HPET timer support in the BIOS, is there any reason to
use this on a server? I do run X on it via the Intel 965 chipset video.

So bottom line is make sure not to use preemption on servers or else you
will get weird spinlock/deadlocks on RAID devices--GOOD To know!

Thanks!

Justin.

2007-01-23 12:48:11

by Michael Tokarev

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

Justin Piszcz wrote:
>
> On Tue, 23 Jan 2007, Michael Tokarev wrote:
>
>> Disabling pre-emption on critical and/or server machines seems to be a good
>> idea in the first place. IMHO anyway.. ;)
>
> So bottom line is make sure not to use preemption on servers or else you
> will get weird spinlock/deadlocks on RAID devices--GOOD To know!

This is not a reason. The reason is that preemption usually works worse
on servers, esp. high-loaded servers - the more often you interrupt a
(kernel) work, the more nedleess context switches you'll have, and the
more slow the whole thing works.

Another point is that with preemption enabled, we have more chances to
hit one or another bug somewhere. Those bugs should be found and fixed
for sure, but important servers/data isn't a place usually for bughunting.

/mjt

2007-01-23 13:46:50

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)



On Tue, 23 Jan 2007, Michael Tokarev wrote:

> Justin Piszcz wrote:
> >
> > On Tue, 23 Jan 2007, Michael Tokarev wrote:
> >
> >> Disabling pre-emption on critical and/or server machines seems to be a good
> >> idea in the first place. IMHO anyway.. ;)
> >
> > So bottom line is make sure not to use preemption on servers or else you
> > will get weird spinlock/deadlocks on RAID devices--GOOD To know!
>
> This is not a reason. The reason is that preemption usually works worse
> on servers, esp. high-loaded servers - the more often you interrupt a
> (kernel) work, the more nedleess context switches you'll have, and the
> more slow the whole thing works.
>
> Another point is that with preemption enabled, we have more chances to
> hit one or another bug somewhere. Those bugs should be found and fixed
> for sure, but important servers/data isn't a place usually for bughunting.
>
> /mjt
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

Thanks for the update/info.

Justin.

2007-01-24 23:37:20

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)



On Mon, 22 Jan 2007, Chuck Ebbert wrote:

> Justin Piszcz wrote:
> > My .config is attached, please let me know if any other information is
> > needed and please CC (lkml) as I am not on the list, thanks!
> >
> > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to
> > the RAID5 running XFS.
> >
> > Any idea what happened here?
> >
> > [473795.214705] BUG: unable to handle kernel paging request at virtual
> > address fffb92b0
> > [473795.214715] printing eip:
> > [473795.214718] c0358b14
> > [473795.214721] *pde = 00003067
> > [473795.214723] *pte = 00000000
> > [473795.214726] Oops: 0000 [#1]
> > [473795.214729] PREEMPT SMP [473795.214736] CPU: 0
> > [473795.214737] EIP: 0060:[<c0358b14>] Not tainted VLI
> > [473795.214738] EFLAGS: 00010286 (2.6.19.2 #1)
> > [473795.214746] EIP is at copy_data+0x6c/0x179
> > [473795.214750] eax: 00000000 ebx: 00001000 ecx: 00000354 edx:
> > fffb9000
> > [473795.214754] esi: fffb92b0 edi: da86c2b0 ebp: 00001000 esp:
> > f7927dc4
> > [473795.214757] ds: 007b es: 007b ss: 0068
> > [473795.214761] Process md4_raid5 (pid: 1305, ti=f7926000 task=f7ea9030
> > task.ti=f7926000)
> > [473795.214765] Stack: c1ba7c40 00000003 f5538c80 00000001 da86c000 00000009
> > 00000000 0000006c [473795.214790] 00001000 da8536a8 aa6fee90 f5538c80
> > 00000190 c0358d00 aa6fee88 0000ffff [473795.214863] d7c5794c 00000001
> > da853488 f6fbec70 f6fbebc0 00000001 00000005 00000001 [473795.214876] Call
> > Trace:
> > [473795.214880] [<c0358d00>] compute_parity5+0xdf/0x497
> > [473795.214887] [<c035b0dd>] handle_stripe+0x930/0x2986
> > [473795.214892] [<c01146b9>] find_busiest_group+0x124/0x4fd
> > [473795.214898] [<c03580e0>] release_stripe+0x21/0x2e
> > [473795.214902] [<c035d233>] raid5d+0x100/0x161
> > [473795.214907] [<c036b03c>] md_thread+0x40/0x103
> > [473795.214912] [<c012dbbe>] autoremove_wake_function+0x0/0x4b
> > [473795.214917] [<c036affc>] md_thread+0x0/0x103
> > [473795.214922] [<c012da1a>] kthread+0xfc/0x100
> > [473795.214926] [<c012d91e>] kthread+0x0/0x100
> > [473795.214930] [<c0103b4b>] kernel_thread_helper+0x7/0x1c
> > [473795.214935] =======================
> > [473795.214938] Code: 14 39 d1 0f 8d 10 01 00 00 89 c8 01 c0 01 c8 01 c0 01
> > c0 89 44 24 1c eb 51 89 d9 c1 e9 02 8b 7c 24 10 01 f7 8b 44 24 18 8d 34 02
> > <f3> a5 89 d9 83 e1 03 74 02 f3 a4 c7 44 24 04 03 00 00 00 89 14
> > [473795.215017] EIP: [<c0358b14>] copy_data+0x6c/0x179 SS:ESP 0068:f7927dc4
> >
> Without digging too deeply, I'd say you've hit the same bug Sami Farin and
> others
> have reported starting with 2.6.19: pages mapped with kmap_atomic() become
> unmapped
> during memcpy() or similar operations. Try disabling preempt -- that seems to
> be the
> common factor.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

After I run some other tests, I am going to re-run this test and see if it
OOPSes again with PREEMPT off.

Justin.

2007-01-26 09:25:56

by Andrew Morton

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

On Wed, 24 Jan 2007 18:37:15 -0500 (EST)
Justin Piszcz <[email protected]> wrote:

> > Without digging too deeply, I'd say you've hit the same bug Sami Farin and
> > others
> > have reported starting with 2.6.19: pages mapped with kmap_atomic() become
> > unmapped
> > during memcpy() or similar operations. Try disabling preempt -- that seems to
> > be the
> > common factor.
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
> After I run some other tests, I am going to re-run this test and see if it
> OOPSes again with PREEMPT off.

Strange. The below debug patch might catch it - please run with this
applied.


--- a/arch/i386/mm/highmem.c~kmap_atomic-debugging
+++ a/arch/i386/mm/highmem.c
@@ -30,7 +30,43 @@ void *kmap_atomic(struct page *page, enu
{
enum fixed_addresses idx;
unsigned long vaddr;
+ static unsigned warn_count = 10;

+ if (unlikely(warn_count == 0))
+ goto skip;
+
+ if (unlikely(in_interrupt())) {
+ if (in_irq()) {
+ if (type != KM_IRQ0 && type != KM_IRQ1 &&
+ type != KM_BIO_SRC_IRQ && type != KM_BIO_DST_IRQ &&
+ type != KM_BOUNCE_READ) {
+ WARN_ON(1);
+ warn_count--;
+ }
+ } else if (!irqs_disabled()) { /* softirq */
+ if (type != KM_IRQ0 && type != KM_IRQ1 &&
+ type != KM_SOFTIRQ0 && type != KM_SOFTIRQ1 &&
+ type != KM_SKB_SUNRPC_DATA &&
+ type != KM_SKB_DATA_SOFTIRQ &&
+ type != KM_BOUNCE_READ) {
+ WARN_ON(1);
+ warn_count--;
+ }
+ }
+ }
+
+ if (type == KM_IRQ0 || type == KM_IRQ1 || type == KM_BOUNCE_READ) {
+ if (!irqs_disabled()) {
+ WARN_ON(1);
+ warn_count--;
+ }
+ } else if (type == KM_SOFTIRQ0 || type == KM_SOFTIRQ1) {
+ if (irq_count() == 0 && !irqs_disabled()) {
+ WARN_ON(1);
+ warn_count--;
+ }
+ }
+skip:
/* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */
pagefault_disable();
if (!PageHighMem(page))
_

2007-01-26 09:37:59

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)



On Fri, 26 Jan 2007, Andrew Morton wrote:

> On Wed, 24 Jan 2007 18:37:15 -0500 (EST)
> Justin Piszcz <[email protected]> wrote:
>
> > > Without digging too deeply, I'd say you've hit the same bug Sami Farin and
> > > others
> > > have reported starting with 2.6.19: pages mapped with kmap_atomic() become
> > > unmapped
> > > during memcpy() or similar operations. Try disabling preempt -- that seems to
> > > be the
> > > common factor.
> > >
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> >
> > After I run some other tests, I am going to re-run this test and see if it
> > OOPSes again with PREEMPT off.
>
> Strange. The below debug patch might catch it - please run with this
> applied.
>
>
> --- a/arch/i386/mm/highmem.c~kmap_atomic-debugging
> +++ a/arch/i386/mm/highmem.c
> @@ -30,7 +30,43 @@ void *kmap_atomic(struct page *page, enu
> {
> enum fixed_addresses idx;
> unsigned long vaddr;
> + static unsigned warn_count = 10;
>
> + if (unlikely(warn_count == 0))
> + goto skip;
> +
> + if (unlikely(in_interrupt())) {
> + if (in_irq()) {
> + if (type != KM_IRQ0 && type != KM_IRQ1 &&
> + type != KM_BIO_SRC_IRQ && type != KM_BIO_DST_IRQ &&
> + type != KM_BOUNCE_READ) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + } else if (!irqs_disabled()) { /* softirq */
> + if (type != KM_IRQ0 && type != KM_IRQ1 &&
> + type != KM_SOFTIRQ0 && type != KM_SOFTIRQ1 &&
> + type != KM_SKB_SUNRPC_DATA &&
> + type != KM_SKB_DATA_SOFTIRQ &&
> + type != KM_BOUNCE_READ) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + }
> + }
> +
> + if (type == KM_IRQ0 || type == KM_IRQ1 || type == KM_BOUNCE_READ) {
> + if (!irqs_disabled()) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + } else if (type == KM_SOFTIRQ0 || type == KM_SOFTIRQ1) {
> + if (irq_count() == 0 && !irqs_disabled()) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + }
> +skip:
> /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */
> pagefault_disable();
> if (!PageHighMem(page))
> _
>
>

The RAID5 bug may be hard to trigger, I have only made it happen once so
far (but only tried it once, don't like locking up the raid :)), I will
re-run the test after applying this patch.

Justin.

2007-01-26 12:31:53

by Justin Piszcz

[permalink] [raw]
Subject: Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

Just re-ran the test 4-5 times, could not reproduce this one, but I'll
keep running this kernel w/patch for a while and see if it happens again.

On Fri, 26 Jan 2007, Andrew Morton wrote:

> On Wed, 24 Jan 2007 18:37:15 -0500 (EST)
> Justin Piszcz <[email protected]> wrote:
>
> > > Without digging too deeply, I'd say you've hit the same bug Sami Farin and
> > > others
> > > have reported starting with 2.6.19: pages mapped with kmap_atomic() become
> > > unmapped
> > > during memcpy() or similar operations. Try disabling preempt -- that seems to
> > > be the
> > > common factor.
> > >
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> >
> > After I run some other tests, I am going to re-run this test and see if it
> > OOPSes again with PREEMPT off.
>
> Strange. The below debug patch might catch it - please run with this
> applied.
>
>
> --- a/arch/i386/mm/highmem.c~kmap_atomic-debugging
> +++ a/arch/i386/mm/highmem.c
> @@ -30,7 +30,43 @@ void *kmap_atomic(struct page *page, enu
> {
> enum fixed_addresses idx;
> unsigned long vaddr;
> + static unsigned warn_count = 10;
>
> + if (unlikely(warn_count == 0))
> + goto skip;
> +
> + if (unlikely(in_interrupt())) {
> + if (in_irq()) {
> + if (type != KM_IRQ0 && type != KM_IRQ1 &&
> + type != KM_BIO_SRC_IRQ && type != KM_BIO_DST_IRQ &&
> + type != KM_BOUNCE_READ) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + } else if (!irqs_disabled()) { /* softirq */
> + if (type != KM_IRQ0 && type != KM_IRQ1 &&
> + type != KM_SOFTIRQ0 && type != KM_SOFTIRQ1 &&
> + type != KM_SKB_SUNRPC_DATA &&
> + type != KM_SKB_DATA_SOFTIRQ &&
> + type != KM_BOUNCE_READ) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + }
> + }
> +
> + if (type == KM_IRQ0 || type == KM_IRQ1 || type == KM_BOUNCE_READ) {
> + if (!irqs_disabled()) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + } else if (type == KM_SOFTIRQ0 || type == KM_SOFTIRQ1) {
> + if (irq_count() == 0 && !irqs_disabled()) {
> + WARN_ON(1);
> + warn_count--;
> + }
> + }
> +skip:
> /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */
> pagefault_disable();
> if (!PageHighMem(page))
> _
>