2012-02-23 02:14:42

by TimLee

[permalink] [raw]
Subject: Hit OOPS on FPU save and restore while useing AESNI for IPSec on 32 bit System

Hi All,

Recently I hit an OOPS on FPU save/restore in Linux version 2.6.38.8 using aesni_intel_asm.S and aesni_intel_glue.c for native IPSec(netkey) on 32bit System. The same OOPS were found in versions 2.6.39.4, 3.0.x and 3.1.x.But I did not hit this problem in 64 bit system for all these versions.

My platform information:
"Linux dnsubuntu 2.6.38.8 #7 SMP Sat Nov 12 03:11:12 CST 2011 i686 i686 i386 GNU/Linux"

IPsec uses these two crypto driver with aead interface
driver : cryptd(__driver-cbc-aes-aesni) --- my understanding (while in irq path, encryption/decryption will be sent to crypto daemon to do an asynchronous operation)
driver : authenc(hmac(sha1-generic),cbc-aes-aesni) --- my understanding (IPsec will call it in softirq via aead interface)
all the function calls such as (cbc_encrypt/cbc_decrypt) in file aesni_intel_glue.c has been protected inside kernel_fpu_begin()/kernel_fpu_end(). I have done some research on how FPU save/restore in Linux. I still can not figure out where the problem is in this case. I wondered how can fxsave/fxrestor OOPS happen? how can tsk->thread->fpu->state be null when PF_MATH_USED or TS_USEDFPU is set?

It is easy to repeat this problem as following steps:
1. build two 32bit system with AESNI in crypto, install openswan, use netkey kernel IPSec stack. Create ESP tunnel between the left and right IPSec gateway.
2. run iperf on host in the left subnet to the host in the right subnet, iperf traffic can be bi-direction.
3. run top or tcpdump inside left and right IPSec gateway
4. From another client or desktop use SSH login to both VPN gateway many times
5. you will find that SSH connection is not stable, top and tcpdump application are not stable ether. In 5 to 10 mins, there will be an OOPS, then system hangs.

I have some questions below:
1. Can functions in aesni_intel_glue.c safely be called in softirq (such as IPSec stack)?
2. I think these functions should not be called in interrupt, is it correct?
3. Have these functions be used/tested for native IPSec of Linux via aead interface on 32 bit platform? This could be a bug for 32bit AESNI usage of Linux native IPSec stack.

I have attached OOPS image, back trace and decodes
Please help to give me some advices on this OOPS, how do you think of this issue, how to fix it?


OOPS info
<snip>
IP: [<c1009880>] __switch_to+0x150/0x190
*pdpt = 0000000030580001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
last sysfs file: /sys/module/serpent/initstate
<snip>

<snip>
Code: 00 80 7d e7 00 74 05 e8 ff 23 00 00 64 89 35 2c 82 85 c1 89 d8 83 c4 14 5b 5e 5f 5d c3 8d b6 00 00 00 00 89 f6 8b 83 4c 03 00 00 <0f> ae 00 8b 83 4c 03 00 00 e9 15 ff ff ff 66 90 8b 83 4c 03 00

[email protected]:/linux-source-2.6.38# find -name decodecode
./scripts/decodecode
[email protected]:/linux-source-2.6.38# echo "Code: 00 80 7d e7 00 74 05 e8 ff 23 00 00 64 89 35 2c 82 85 c1 89 d8 83 c4 14 5b 5e 5f 5d c3 8d b6 00 00 00 00 89 f6 8b 83 4c 03 00 00 <0f> ae 00 8b 83 4c 03 00 00 e9 15 ff ff ff 66 90 8b 83 4c 03 00" | ./scripts/decodecode
Code: 00 80 7d e7 00 74 05 e8 ff 23 00 00 64 89 35 2c 82 85 c1 89 d8 83 c4 14 5b 5e 5f 5d c3 8d b6 00 00 00 00 89 f6 8b 83 4c 03 00 00 <0f> ae 00 8b 83 4c 03 00 00 e9 15 ff ff ff 66 90 8b 83 4c 03 00
All code
========
0: 00 80 7d e7 00 74 add %al,0x7400e77d(%eax)
6: 05 e8 ff 23 00 add $0x23ffe8,%eax
b: 00 64 89 35 add %ah,0x35(%ecx,%ecx,4)
f: 2c 82 sub $0x82,%al
11: 85 c1 test %eax,%ecx
13: 89 d8 mov %ebx,%eax
15: 83 c4 14 add $0x14,%esp
18: 5b pop %ebx
19: 5e pop %esi
1a: 5f pop %edi
1b: 5d pop %ebp
1c: c3 ret
1d: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
23: 89 f6 mov %esi,%esi
25: 8b 83 4c 03 00 00 mov 0x34c(%ebx),%eax
2b:* 0f ae 00 fxsave (%eax) <-- trapping instruction
2e: 8b 83 4c 03 00 00 mov 0x34c(%ebx),%eax
34: e9 15 ff ff ff jmp 0xffffff4e
39: 66 90 xchg %ax,%ax
3b: 8b .byte 0x8b
3c: 83 .byte 0x83
3d: 4c dec %esp
3e: 03 00 add (%eax),%eax

Code starting with the faulting instruction
===========================================
0: 0f ae 00 fxsave (%eax)
3: 8b 83 4c 03 00 00 mov 0x34c(%ebx),%eax
9: e9 15 ff ff ff jmp 0xffffff23
e: 66 90 xchg %ax,%ax
10: 8b .byte 0x8b
11: 83 .byte 0x83
12: 4c dec %esp
13: 03 00 add (%eax),%eax
[email protected]:/linux-source-2.6.38#
^C^CInterrupted while waiting for the program.
Give up (and stop debugging it)? (y or n) y
(gdb) target remote /dev/ttyS1
Remote debugging using /dev/ttyS1
fpu_fxsave (prev_p=0xf17c71a0, next_p=0xf5891940)
at /linux-source-2.6.38/arch/x86/include/asm/i387.h:209
209 asm volatile("fxsave %[fx]"
(gdb) bt
#0 fpu_fxsave (prev_p=0xf17c71a0, next_p=0xf5891940)
at /linux-source-2.6.38/arch/x86/include/asm/i387.h:209
#1 fpu_save_init (prev_p=0xf17c71a0, next_p=0xf5891940)
at /linux-source-2.6.38/arch/x86/include/asm/i387.h:238
#2 __save_init_fpu (prev_p=0xf17c71a0, next_p=0xf5891940)
at /linux-source-2.6.38/arch/x86/include/asm/i387.h:261
#3 __unlazy_fpu (prev_p=0xf17c71a0, next_p=0xf5891940)
at /linux-source-2.6.38/arch/x86/include/asm/i387.h:292
#4 __switch_to (prev_p=0xf17c71a0, next_p=0xf5891940)
at arch/x86/kernel/process_32.c:316
#5 0xc151fb3b in context_switch () at kernel/sched.c:2946
#6 schedule () at kernel/sched.c:3999
#7 0xc105073b in __cond_resched () at kernel/sched.c:5258
#8 0xc1520318 in _cond_resched () at kernel/sched.c:5265
#9 0xc1120419 in slab_pre_alloc_hook (s=<value optimized out>, gfpflags=208)
at mm/slub.c:795
#10 slab_alloc (s=<value optimized out>, gfpflags=208) at mm/slub.c:1744
#11 kmem_cache_alloc (s=<value optimized out>, gfpflags=208) at mm/slub.c:1770
#12 0xc113ef91 in d_alloc (parent=0x0, name=0xf09d3f24) at fs/dcache.c:1286
#13 0xc113f1ab in d_alloc_pseudo (sb=0xf58b5800, name=<value optimized out>)
at fs/dcache.c:1343
#14 0xc1435269 in sock_alloc_file (sock=0xf5667c40, f=0xf09d3f4c, flags=526336)
at net/socket.c:365
---Type <return> to continue, or q <return> to quit---
#15 0xc1435326 in sock_map_fd (sock=<value optimized out>,
flags=<value optimized out>) at net/socket.c:397
#16 0xc14364ac in sys_socket (family=1, type=1, protocol=0)
at net/socket.c:1313
#17 0xc1437768 in sys_socketcall (call=1, args=0xbfdb6398) at net/socket.c:2256
#18 <signal handler called>
#19 0xb7786424 in ?? ()
#20 0xb7721e11 in ?? ()
#21 0xb77222b9 in ?? ()
#22 0xb771f424 in ?? ()
#23 0xb771f7e2 in ?? ()
#24 0xb76b50c9 in ?? ()
#25 0xb76b4a0f in ?? ()
#26 0x08048627 in ?? ()
#27 0xb7633e37 in ?? ()
#28 0x08048501 in ?? ()
(gdb)
</snip>

Thanks & Regards
TimLee


2012-02-25 08:37:07

by Herbert Xu

[permalink] [raw]
Subject: Re: Hit OOPS on FPU save and restore while useing AESNI for IPSec on 32 bit System

[email protected] wrote:
> Hi All,
>
> Recently I hit an OOPS on FPU save/restore in Linux version 2.6.38.8 using aesni_intel_asm.S and aesni_intel_glue.c for native IPSec(netkey) on 32bit System. The same OOPS were found in versions 2.6.39.4, 3.0.x and 3.1.x.But I did not hit this problem in 64 bit system for all these versions.
>
> My platform information:
> "Linux dnsubuntu 2.6.38.8 #7 SMP Sat Nov 12 03:11:12 CST 2011 i686 i686 i386 GNU/Linux"

Please try the latest git tree (rc4). It should be fixed.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt