2010-01-23 16:07:20

by Bruno Prémont

[permalink] [raw]
Subject: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

Today I tried out 2.6.33-rc5 but always got crashing system:

[ 26.390576] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 26.396369] IP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q]
[ 26.400534] *pde = 00000000
[ 26.400534] Oops: 0002 [#1]
[ 26.400534] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
[ 26.400534] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc 8021q snd_pcm_oss snd_mixer_oss xfs exportfs loop snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm ehci_hcd snd_timer uhci_hcd usbcore pcspkr i2c_i801 nsc_ircc snd irda snd_page_alloc crc_ccitt
[ 26.400534]
[ 26.400534] Pid: 0, comm: swapper Not tainted 2.6.33-rc5-00001-g0d9d71d #5 TravelMate 660/TravelMate 660
[ 26.400534] EIP: 0060:[<df856b89>] EFLAGS: 00010246 CPU: 0
[ 26.400534] EIP is at vlan_skb_recv+0x89/0x280 [8021q]
[ 26.400534] EAX: da74d000 EBX: 00000000 ECX: da046140 EDX: 00000004
[ 26.400534] ESI: 00000000 EDI: da023000 EBP: c13f7f10 ESP: c13f7ef4
[ 26.400534] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 26.400534] Process swapper (pid: 0, ti=c13f7000 task=c13fbb40 task.ti=c13f4000)
[ 26.400534] Stack:
[ 26.400534] c13f7f10 c10420c9 4b5b0d11 da099430 ddbb8800 da023000 df8584c0 c13f7f44
[ 26.400534] <0> c1249bdc ddbb8800 c142b000 da023098 ddbb8800 c142b000 df8584c0 00000000
[ 26.400534] <0> 00000081 da023000 ddbb8b60 da023000 c13f7fa4 c1208213 c123e9ea da059b40
[ 26.400534] Call Trace:
[ 26.400534] [<c10420c9>] ? ktime_get_real+0x19/0x40
[ 26.400534] [<c1249bdc>] ? netif_receive_skb+0x2cc/0x380
[ 26.400534] [<c1208213>] ? b44_poll+0x4e3/0x580
[ 26.400534] [<c123e9ea>] ? __kfree_skb+0x3a/0x90
[ 26.400534] [<c124a38a>] ? net_rx_action+0x9a/0x130
[ 26.400534] [<c102bb66>] ? __do_softirq+0x76/0x100
[ 26.400534] [<c102baf0>] ? __do_softirq+0x0/0x100
[ 26.400534] <IRQ>
[ 26.400534] [<c102b965>] ? irq_exit+0x55/0x70
[ 26.400534] [<c1004582>] ? do_IRQ+0x42/0xb0
[ 26.400534] [<c103f13d>] ? sched_clock_local+0xdd/0x180
[ 26.400534] [<c1003169>] ? common_interrupt+0x29/0x30
[ 26.400534] [<c103007b>] ? run_timer_softirq+0x18b/0x1c0
[ 26.400534] [<c1165673>] ? acpi_idle_enter_simple+0xe1/0x10c
[ 26.400534] [<c116544b>] ? acpi_idle_enter_bm+0xb9/0x200
[ 26.400534] [<c122cefb>] ? cpuidle_idle_call+0x8b/0xe0
[ 26.400534] [<c1001d7c>] ? cpu_idle+0x3c/0x60
[ 26.400534] [<c12d514d>] ? rest_init+0x4d/0x60
[ 26.400534] [<c142d6cf>] ? start_kernel+0x21f/0x259
[ 26.400534] [<c142d28b>] ? unknown_bootoption+0x0/0x1d3
[ 26.400534] [<c142d067>] ? i386_start_kernel+0x67/0x69
[ 26.400534] Code: 89 f0 0f b7 da 89 da 81 e2 ff 0f 00 00 e8 90 e4 ff ff 85 c0 89 47 14 0f 84 15 01 00 00 8b b6 e4 03 00 00 ba 04 00 00 00 c1 eb 0d <ff> 06 8b 47 4c 01 46 04 89 f8 e8 c8 77 9e e1 8b 57 14 89 55 e8
[ 26.400534] EIP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q] SS:ESP 0068:c13f7ef4
[ 26.400534] CR2: 0000000000000000
[ 26.710884] ---[ end trace 0c4b7a45bd92e2ba ]---
[ 26.718562] Kernel panic - not syncing: Fatal exception in interrupt
[ 26.726395] Pid: 0, comm: swapper Tainted: G D 2.6.33-rc5-00001-g0d9d71d #5
[ 26.734339] Call Trace:
[ 26.742290] [<c12db5d0>] ? printk+0x18/0x20
[ 26.750384] [<c12db508>] panic+0x44/0xf4
[ 26.758522] [<c1005bce>] oops_end+0x7e/0x90
[ 26.766754] [<c101cc0f>] no_context+0xbf/0x150
[ 26.774945] [<c101ccef>] __bad_area_nosemaphore+0x4f/0x180
[ 26.783227] [<c1248775>] ? dev_hard_start_xmit+0x205/0x2b0
[ 26.791583] [<c103999b>] ? autoremove_wake_function+0x1b/0x50
[ 26.800005] [<c12588bc>] ? sch_direct_xmit+0xec/0x140
[ 26.808462] [<c12489e9>] ? dev_queue_xmit+0x1c9/0x4b0
[ 26.816921] [<df85653c>] ? vlan_dev_hard_start_xmit+0x13c/0x1c0 [8021q]
[ 26.825443] [<c101ce32>] bad_area_nosemaphore+0x12/0x20
[ 26.833978] [<c101d21c>] do_page_fault+0x25c/0x2f0
[ 26.842561] [<c1257df5>] ? eth_header+0x25/0xc0
[ 26.851196] [<c101cfc0>] ? do_page_fault+0x0/0x2f0
[ 26.859859] [<c12dd406>] error_code+0x5e/0x64
[ 26.868578] [<c104007b>] ? sys_getgroups+0x1b/0xc0
[ 26.877206] [<c101cfc0>] ? do_page_fault+0x0/0x2f0
[ 26.885747] [<df856b89>] ? vlan_skb_recv+0x89/0x280 [8021q]
[ 26.894283] [<c10420c9>] ? ktime_get_real+0x19/0x40
[ 26.902859] [<c1249bdc>] netif_receive_skb+0x2cc/0x380
[ 26.911318] [<c1208213>] b44_poll+0x4e3/0x580
[ 26.919466] [<c123e9ea>] ? __kfree_skb+0x3a/0x90
[ 26.927297] [<c124a38a>] net_rx_action+0x9a/0x130
[ 26.934979] [<c102bb66>] __do_softirq+0x76/0x100
[ 26.942530] [<c102baf0>] ? __do_softirq+0x0/0x100
[ 26.949952] <IRQ> [<c102b965>] ? irq_exit+0x55/0x70
[ 26.957401] [<c1004582>] ? do_IRQ+0x42/0xb0
[ 26.964803] [<c103f13d>] ? sched_clock_local+0xdd/0x180
[ 26.972247] [<c1003169>] ? common_interrupt+0x29/0x30
[ 26.979667] [<c103007b>] ? run_timer_softirq+0x18b/0x1c0
[ 26.987033] [<c1165673>] ? acpi_idle_enter_simple+0xe1/0x10c
[ 26.994363] [<c116544b>] ? acpi_idle_enter_bm+0xb9/0x200
[ 27.001573] [<c122cefb>] ? cpuidle_idle_call+0x8b/0xe0
[ 27.008683] [<c1001d7c>] ? cpu_idle+0x3c/0x60
[ 27.015733] [<c12d514d>] ? rest_init+0x4d/0x60
[ 27.022683] [<c142d6cf>] ? start_kernel+0x21f/0x259
[ 27.029627] [<c142d28b>] ? unknown_bootoption+0x0/0x1d3
[ 27.036592] [<c142d067>] ? i386_start_kernel+0x67/0x69
[ 27.043549] [drm:drm_fb_helper_panic] *ERROR* panic occurred, switching back to text console

repeating the following block many times until until kernel finally
ends up at panic.

[ 27.050739] BUG: scheduling while atomic: swapper/0/0x10000100
[ 27.057977] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc 8021q snd_pcm_oss snd_mixer_oss xfs exportfs loop snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm ehci_hcd snd_timer uhci_hcd usbcore pcspkr i2c_i801 nsc_ircc snd irda snd_page_alloc crc_ccitt
[ 27.074086] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc 8021q snd_pcm_oss snd_mixer_oss xfs exportfs loop snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm ehci_hcd snd_timer uhci_hcd usbcore pcspkr i2c_i801 nsc_ircc snd irda snd_page_alloc crc_ccitt
[ 27.091490]
[ 27.100093] Pid: 0, comm: swapper Tainted: G D 2.6.33-rc5-00001-g0d9d71d #5 TravelMate 660/TravelMate 660
[ 27.109086] EIP: 0060:[<c1165673>] EFLAGS: 00000286 CPU: 0
[ 27.118106] EIP is at acpi_idle_enter_simple+0xe1/0x10c
[ 27.127138] EAX: c13f4f68 EBX: 00000333 ECX: 00000000 EDX: 00000006
[ 27.136253] ESI: 00000000 EDI: dd851c48 EBP: c13f4f88 ESP: c13f4f68
[ 27.145445] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 27.154661] Process swapper (pid: 0, ti=c13f7000 task=c13fbb40 task.ti=c13f4000)
[ 27.164033] Stack:
[ 27.173342] 000f4240 00000000 dd851800 000000e5 00000000 dd851934 dd85181c dd851cbc
[ 27.173636] <0> c13f4fac c116544b dd851800 00000000 00000055 00000001 dd851934 dd85181c
[ 27.183296] <0> c13f5000 c13f4fbc c122cefb c1455940 0009fe00 c13f4fc4 c1001d7c c13f4fcc
[ 27.202566] Call Trace:
[ 27.212012] [<c116544b>] ? acpi_idle_enter_bm+0xb9/0x200
[ 27.221307] [<c122cefb>] ? cpuidle_idle_call+0x8b/0xe0
[ 27.230476] [<c1001d7c>] ? cpu_idle+0x3c/0x60
[ 27.239548] [<c12d514d>] ? rest_init+0x4d/0x60
[ 27.248602] [<c142d6cf>] ? start_kernel+0x21f/0x259
[ 27.257627] [<c142d28b>] ? unknown_bootoption+0x0/0x1d3
[ 27.266563] [<c142d067>] ? i386_start_kernel+0x67/0x69
[ 27.275418] Code: 69 4d f0 99 9e 36 00 8d 14 11 e8 69 1b fb ff 89 c3 b8 17 01 00 00 69 ca 17 01 00 00 89 d6 f7 e3 8d 14 11 e8 20 9d ed ff fb 89 e0 <31> c9 25 00 f0 ff ff 89 fa 83 48 0c 04 ff 47 18 8b 45 e8 e8 f3
[ 27.294819] Call Trace:
[ 27.304169] [<c116544b>] acpi_idle_enter_bm+0xb9/0x200
[ 27.313634] [<c122cefb>] cpuidle_idle_call+0x8b/0xe0
[ 27.323086] [<c1001d7c>] cpu_idle+0x3c/0x60
[ 27.332507] [<c12d514d>] rest_init+0x4d/0x60
[ 27.341898] [<c142d6cf>] start_kernel+0x21f/0x259
[ 27.351278] [<c142d28b>] ? unknown_bootoption+0x0/0x1d3
[ 27.360691] [<c142d067>] i386_start_kernel+0x67/0x69


Kernel boots fine up to init, but sometime before getting login prompt it
floods console with traces until finally panicing. This always happens
after network startup (My B44 interface is renamed to lan and I have one
vlan active on top of it, both vlan and parent device being used)

2.6.32 (with a few DRM patches from 2.6.33-rc0 applied) works fine.


Objdump of 8021q.ko, vlan_skb_recv() around the crashing code:
00001b00 <vlan_skb_recv>:
1b00: 55 push %ebp
1b01: 89 e5 mov %esp,%ebp
1b03: 57 push %edi
1b04: 56 push %esi
1b05: 89 d6 mov %edx,%esi
1b07: 53 push %ebx
1b08: 89 c3 mov %eax,%ebx
1b0a: 83 ec 10 sub $0x10,%esp
1b0d: 89 df mov %ebx,%edi
1b0f: 8b 80 98 00 00 00 mov 0x98(%eax),%eax
1b15: 48 dec %eax
1b16: 74 15 je 1b2d <vlan_skb_recv+0x2d>
1b18: ba 20 00 00 00 mov $0x20,%edx
1b1d: 89 d8 mov %ebx,%eax
1b1f: e8 fc ff ff ff call 1b20 <vlan_skb_recv+0x20>
1b24: 89 c7 mov %eax,%edi
1b26: 89 d8 mov %ebx,%eax
1b28: e8 fc ff ff ff call 1b29 <vlan_skb_recv+0x29>
1b2d: 85 ff test %edi,%edi
1b2f: 0f 84 5b 01 00 00 je 1c90 <vlan_skb_recv+0x190>
1b35: 8b 47 4c mov 0x4c(%edi),%eax
1b38: 8b 57 50 mov 0x50(%edi),%edx
1b3b: 89 c1 mov %eax,%ecx
1b3d: 29 d1 sub %edx,%ecx
1b3f: 83 f9 03 cmp $0x3,%ecx
1b42: 0f 86 60 01 00 00 jbe 1ca8 <vlan_skb_recv+0x1a8>
1b48: 8b 87 90 00 00 00 mov 0x90(%edi),%eax
1b4e: 89 45 f0 mov %eax,-0x10(%ebp)
1b51: 0f b7 00 movzwl (%eax),%eax
1b54: 89 c2 mov %eax,%edx
1b56: c1 e2 08 shl $0x8,%edx
1b59: c1 e8 08 shr $0x8,%eax
1b5c: 09 c2 or %eax,%edx
1b5e: 89 f0 mov %esi,%eax
1b60: 0f b7 da movzwl %dx,%ebx
1b63: 89 da mov %ebx,%edx
1b65: 81 e2 ff 0f 00 00 and $0xfff,%edx
1b6b: e8 fc ff ff ff call 1b6c <vlan_skb_recv+0x6c>
1b70: 85 c0 test %eax,%eax
1b72: 89 47 14 mov %eax,0x14(%edi)
1b75: 0f 84 15 01 00 00 je 1c90 <vlan_skb_recv+0x190>
1b7b: 8b b6 e4 03 00 00 mov 0x3e4(%esi),%esi
1b81: ba 04 00 00 00 mov $0x4,%edx
1b86: c1 eb 0d shr $0xd,%ebx
1b89: ff 06 incl (%esi)
^___ this matches function offset and ESI is NULL according to
trace
1b8b: 8b 47 4c mov 0x4c(%edi),%eax
1b8e: 01 46 04 add %eax,0x4(%esi)
1b91: 89 f8 mov %edi,%eax
1b93: e8 fc ff ff ff call 1b94 <vlan_skb_recv+0x94>
1b98: 8b 57 14 mov 0x14(%edi),%edx
1b9b: 89 55 e8 mov %edx,-0x18(%ebp)
1b9e: 8b 84 9a 64 03 00 00 mov 0x364(%edx,%ebx,4),%eax
1ba5: 0f b6 4f 61 movzbl 0x61(%edi),%ecx
1ba9: 89 47 5c mov %eax,0x5c(%edi)
1bac: 88 c8 mov %cl,%al
1bae: 24 07 and $0x7,%al

int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *ptype, struct net_device *orig_dev)
{
struct vlan_hdr *vhdr;
struct vlan_rx_stats *rx_stats;
u16 vlan_id;
u16 vlan_tci;

skb = skb_share_check(skb, GFP_ATOMIC);
if (skb == NULL)
goto err_free;

if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
goto err_free;

vhdr = (struct vlan_hdr *)skb->data;
vlan_tci = ntohs(vhdr->h_vlan_TCI);
vlan_id = vlan_tci & VLAN_VID_MASK;

rcu_read_lock();
skb->dev = __find_vlan_dev(dev, vlan_id);
if (!skb->dev) {
pr_debug("%s: ERROR: No net_device for VID: %u on dev: %s\n",
__func__, vlan_id, dev->name);
goto err_unlock;
}

rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
smp_processor_id());
rx_stats->rx_packets++;
^___ looks like it is crashing here... (line 168 in net/8021q/vlan_dev.c)

rx_stats->rx_bytes += skb->len;


Above part of code did change between 2.6.32 and 2.6.33-rc5 with commit
9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX stats accounting)
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267

Note, my system is compiled with SMP=n as I just have a single-core CPU
in this laptop. Kernel config file attached. If complete kernel log is
needed, just ask.

Bruno


Attachments:
(No filename) (13.23 kB)
.config (64.21 kB)
Download all attachments

2010-01-23 21:32:05

by Bruno Prémont

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

> Above part of code did change between 2.6.32 and 2.6.33-rc5 with
> commit 9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX
> stats accounting)
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267

Reverting just that commit gets the system running correctly.

Bruno

2010-01-24 07:10:59

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

On Sat, Jan 23, 2010 at 04:56:57PM +0100, Bruno Prémont wrote:
>Today I tried out 2.6.33-rc5 but always got crashing system:
>
>[ 26.390576] BUG: unable to handle kernel NULL pointer dereference at (null)
>[ 26.396369] IP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q]


What does 'addr2line -e vmlinux 0xdf856b89' say?

Thanks.


--
Live like a child, think like the god.

2010-01-24 07:17:30

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

On Sun, Jan 24, 2010 at 03:12:52PM +0800, Am?rico Wang wrote:
> On Sat, Jan 23, 2010 at 04:56:57PM +0100, Bruno Pr?mont wrote:
> >Today I tried out 2.6.33-rc5 but always got crashing system:
> >
> >[ 26.390576] BUG: unable to handle kernel NULL pointer dereference at (null)
> >[ 26.396369] IP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q]
>
>
> What does 'addr2line -e vmlinux 0xdf856b89' say?

EIP is in module, so it will say nothing.

2010-01-24 07:28:21

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

On Sun, Jan 24, 2010 at 09:17:28AM +0200, Alexey Dobriyan wrote:
>On Sun, Jan 24, 2010 at 03:12:52PM +0800, Américo Wang wrote:
>> On Sat, Jan 23, 2010 at 04:56:57PM +0100, Bruno Prémont wrote:
>> >Today I tried out 2.6.33-rc5 but always got crashing system:
>> >
>> >[ 26.390576] BUG: unable to handle kernel NULL pointer dereference at (null)
>> >[ 26.396369] IP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q]
>>
>>
>> What does 'addr2line -e vmlinux 0xdf856b89' say?
>
>EIP is in module, so it will say nothing.

Oh, ok, then '-e net/8021q/vlan_dev.o'.

--
Live like a child, think like the god.

2010-01-24 13:43:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

Le 23/01/2010 22:31, Bruno Pr?mont a ?crit :
>> Above part of code did change between 2.6.32 and 2.6.33-rc5 with
>> commit 9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX
>> stats accounting)
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267
>
> Reverting just that commit gets the system running correctly.
>
> Bruno

I have no idea how this patch can break vlan networking.

Your disassembly and .config seems to show your machine is not SMP

Maybe something is broken on UP and alloc_percpu() ?

Could you add a debug in vlan_dev_init()


vlan_dev_info(dev)->vlan_rx_stats = alloc_percpu(struct vlan_rx_stats);
if (!vlan_dev_info(dev)->vlan_rx_stats)
return -ENOMEM;
+ pr_err("vlan_dev_init() rx_stats=%p\n", vlan_dev_info(dev)->vlan_rx_stats);


This make sure vlan_dev_init() is called and percpu allocation is properly done.

Thanks

2010-01-24 15:11:48

by Bruno Prémont

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

On Sun, 24 January 2010 Eric Dumazet <[email protected]> wrote:
> Le 23/01/2010 22:31, Bruno Prémont a écrit :
> >> Above part of code did change between 2.6.32 and 2.6.33-rc5 with
> >> commit 9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX
> >> stats accounting)
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267
> >
> > Reverting just that commit gets the system running correctly.
> >
> > Bruno
>
> I have no idea how this patch can break vlan networking.
>
> Your disassembly and .config seems to show your machine is not SMP
>
> Maybe something is broken on UP and alloc_percpu() ?
>
> Could you add a debug in vlan_dev_init()
>
>
> vlan_dev_info(dev)->vlan_rx_stats = alloc_percpu(struct
> vlan_rx_stats); if (!vlan_dev_info(dev)->vlan_rx_stats)
> return -ENOMEM;
> + pr_err("vlan_dev_init() rx_stats=%p\n",
> vlan_dev_info(dev)->vlan_rx_stats);
>
>
> This make sure vlan_dev_init() is called and percpu allocation is
> properly done.
>
> Thanks

Readding your "precise RX stats" commit with following additional changes I get
(first hunk to avoid crash):

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b788978..9216a98 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -165,8 +165,11 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,

rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
smp_processor_id());
- rx_stats->rx_packets++;
- rx_stats->rx_bytes += skb->len;
+ if (rx_stats) {
+ rx_stats->rx_packets++;
+ rx_stats->rx_bytes += skb->len;
+ } else
+ pr_err("vlan_skb_recv() rx_stats=%p -> %p\n",
vlan_dev_info(dev)->vlan_rx_stats, rx_stats);
skb_pull_rcsum(skb, VLAN_HLEN);

@@ -738,6 +741,7 @@ static int vlan_dev_init(struct net_device *dev)
vlan_dev_info(dev)->vlan_rx_stats = alloc_percpu(struct
vlan_rx_stats); if (!vlan_dev_info(dev)->vlan_rx_stats)
return -ENOMEM;
+ pr_err("vlan_dev_init() rx_stats=%p\n",
vlan_dev_info(dev)->vlan_rx_stats);
return 0;
}


...
[ 13.877610] Ending clean XFS mount for filesystem: loop0
[ 15.795601] Adding 1004020k swap on /dev/sda5. Priority:-1 extents:1 across:1004020k
[ 16.612143] ADDRCONF(NETDEV_UP): lan: link is not ready
[ 16.851846] 802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
[ 16.851856] All bugs added by David S. Miller <[email protected]>
[ 16.878049] vlan_dev_init() rx_stats=dceae290
[ 20.040395] b44: lan: Link is up at 100 Mbps, full duplex.
[ 20.040404] b44: lan: Flow control is off for TX and off for RX.
[ 20.040535] ADDRCONF(NETDEV_CHANGE): lan: link becomes ready
[ 25.019941] RPC: Registered udp transport module.
[ 25.019950] RPC: Registered tcp transport module.
[ 25.019956] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 25.382400] svc: failed to register lockdv1 RPC service (errno 97).
[ 25.385278] mount.nfs used greatest stack depth: 1012 bytes left
[ 25.872973] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 26.740561] vlan_skb_recv() rx_stats=(null) -> (null)
[ 26.775071] vlan_skb_recv() rx_stats=(null) -> (null)
[ 27.050223] lan.658: no IPv6 routers present
[ 58.357052] vlan_skb_recv() rx_stats=(null) -> (null)
[ 63.350334] vlan_skb_recv() rx_stats=(null) -> (null)
[ 91.767589] vlan_skb_recv() rx_stats=(null) -> (null)
[ 91.932344] vlan_skb_recv() rx_stats=(null) -> (null)
[ 91.933053] vlan_skb_recv() rx_stats=(null) -> (null)
[ 91.998628] vlan_skb_recv() rx_stats=(null) -> (null)
[ 92.002752] vlan_skb_recv() rx_stats=(null) -> (null)
[ 92.007918] vlan_skb_recv() rx_stats=(null) -> (null)
[ 92.009644] vlan_skb_recv() rx_stats=(null) -> (null)
[ 92.016789] vlan_skb_recv() rx_stats=(null) -> (null)
[ 92.018520] vlan_skb_recv() rx_stats=(null) -> (null)
[ 92.041737] vlan_skb_recv() rx_stats=(null) -> (null)
...

So during vlan_dev_init() rx_stats is properly initialized, but when
packets reach the vlan dev later on exactly that same pointer is NULL...

So question is, who did reset vlan_dev_info(dev)->vlan_rx_stats to NULL?

Thanks,
Bruno

2010-01-24 15:26:16

by Bruno Prémont

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

On Sun, 24 January 2010 Eric Dumazet <[email protected]> wrote:
> Le 23/01/2010 22:31, Bruno Prémont a écrit :
> >> Above part of code did change between 2.6.32 and 2.6.33-rc5 with
> >> commit 9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX
> >> stats accounting)
> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267
> >
> > Reverting just that commit gets the system running correctly.
> >
> > Bruno
>
> I have no idea how this patch can break vlan networking.
>
> Your disassembly and .config seems to show your machine is not SMP

Exact

> Maybe something is broken on UP and alloc_percpu() ?

Apparently not, see below and previous mail

> Could you add a debug in vlan_dev_init()

In addition to previous mail, I'm also dumping the result of
vlan_dev_info(dev) shows that the returned pointer is not the same
during vlan_dev_init() and vlan_skb_recv() ...

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b788978..f370ce1 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -165,8 +165,11 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,

rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
smp_processor_id());
- rx_stats->rx_packets++;
- rx_stats->rx_bytes += skb->len;
+ if (rx_stats) {
+ rx_stats->rx_packets++;
+ rx_stats->rx_bytes += skb->len;
+ } else
+ pr_err("vlan_skb_recv() %p->rx_stats=%p -> %p\n", vlan_dev_info(dev), vlan_dev_info(dev)->vlan_rx_stats, rx_stats);

skb_pull_rcsum(skb, VLAN_HLEN);

@@ -738,6 +741,7 @@ static int vlan_dev_init(struct net_device *dev)
vlan_dev_info(dev)->vlan_rx_stats = alloc_percpu(struct vlan_rx_stats);
if (!vlan_dev_info(dev)->vlan_rx_stats)
return -ENOMEM;
+ pr_err("vlan_dev_init() %p->rx_stats=%p\n", vlan_dev_info(dev), vlan_dev_info(dev)->vlan_rx_stats);

return 0;
}

This slightly adjusted change produces the following output:
...
[ 1192.257978] ADDRCONF(NETDEV_UP): lan: link is not ready
[ 1192.399059] 802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
[ 1192.399063] All bugs added by David S. Miller <[email protected]>
[ 1192.404475] vlan_dev_init() da4ff360->rx_stats=dbd5a340
^^^^^^^^
[ 1196.000225] b44: lan: Link is up at 100 Mbps, full duplex.
[ 1196.000234] b44: lan: Flow control is off for TX and off for RX.
[ 1196.000379] ADDRCONF(NETDEV_CHANGE): lan: link becomes ready
[ 1203.160226] lan.658: no IPv6 routers present
[ 1212.760561] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
^^^^^^^^
[ 1212.794961] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
[ 1219.247018] svc: failed to register lockdv1 RPC service (errno 97).
[ 1219.249919] mount.nfs used greatest stack depth: 1008 bytes left
[ 1221.388602] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
[ 1221.388690] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
[ 1278.793350] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
[ 1283.750363] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
...

This might explain the NULL rx_stats pointer, but why do there exist
two distinct vlan_dev_info(dev)? (unless in one case dev would be
the physical network device and in the other case it would be vlan device?
that is lan versus lan.658 in my case...)

Thanks,
Bruno

2010-01-24 16:23:19

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

On Sun, Jan 24, 2010 at 04:25:49PM +0100, Bruno Prémont wrote:
>On Sun, 24 January 2010 Eric Dumazet <[email protected]> wrote:
>> Le 23/01/2010 22:31, Bruno Prémont a écrit :
>> >> Above part of code did change between 2.6.32 and 2.6.33-rc5 with
>> >> commit 9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX
>> >> stats accounting)
>> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267
>> >
>> > Reverting just that commit gets the system running correctly.
>> >
>> > Bruno
>>
>> I have no idea how this patch can break vlan networking.
>>
>> Your disassembly and .config seems to show your machine is not SMP
>
>Exact
>
>> Maybe something is broken on UP and alloc_percpu() ?
>
>Apparently not, see below and previous mail
>
>> Could you add a debug in vlan_dev_init()
>
>In addition to previous mail, I'm also dumping the result of
>vlan_dev_info(dev) shows that the returned pointer is not the same
>during vlan_dev_init() and vlan_skb_recv() ...
>
>diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
>index b788978..f370ce1 100644
>--- a/net/8021q/vlan_dev.c
>+++ b/net/8021q/vlan_dev.c
>@@ -165,8 +165,11 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
>
> rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
> smp_processor_id());


I am thinking if vlan_dev_info(dev) here should be
vlan_dev_info(skb->dev)...



>- rx_stats->rx_packets++;
>- rx_stats->rx_bytes += skb->len;
>+ if (rx_stats) {
>+ rx_stats->rx_packets++;
>+ rx_stats->rx_bytes += skb->len;
>+ } else
>+ pr_err("vlan_skb_recv() %p->rx_stats=%p -> %p\n", vlan_dev_info(dev), vlan_dev_info(dev)->vlan_rx_stats, rx_stats);
>
> skb_pull_rcsum(skb, VLAN_HLEN);
>
>@@ -738,6 +741,7 @@ static int vlan_dev_init(struct net_device *dev)
> vlan_dev_info(dev)->vlan_rx_stats = alloc_percpu(struct vlan_rx_stats);
> if (!vlan_dev_info(dev)->vlan_rx_stats)
> return -ENOMEM;
>+ pr_err("vlan_dev_init() %p->rx_stats=%p\n", vlan_dev_info(dev), vlan_dev_info(dev)->vlan_rx_stats);
>
> return 0;
> }
>
>This slightly adjusted change produces the following output:
>...
>[ 1192.257978] ADDRCONF(NETDEV_UP): lan: link is not ready
>[ 1192.399059] 802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
>[ 1192.399063] All bugs added by David S. Miller <[email protected]>
>[ 1192.404475] vlan_dev_init() da4ff360->rx_stats=dbd5a340
> ^^^^^^^^
>[ 1196.000225] b44: lan: Link is up at 100 Mbps, full duplex.
>[ 1196.000234] b44: lan: Flow control is off for TX and off for RX.
>[ 1196.000379] ADDRCONF(NETDEV_CHANGE): lan: link becomes ready
>[ 1203.160226] lan.658: no IPv6 routers present
>[ 1212.760561] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
> ^^^^^^^^
>[ 1212.794961] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
>[ 1219.247018] svc: failed to register lockdv1 RPC service (errno 97).
>[ 1219.249919] mount.nfs used greatest stack depth: 1008 bytes left
>[ 1221.388602] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
>[ 1221.388690] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
>[ 1278.793350] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
>[ 1283.750363] vlan_skb_recv() ddbb8b60->rx_stats=(null) -> (null)
>...
>
>This might explain the NULL rx_stats pointer, but why do there exist
>two distinct vlan_dev_info(dev)? (unless in one case dev would be
>the physical network device and in the other case it would be vlan device?
>that is lan versus lan.658 in my case...)
>
>Thanks,
>Bruno
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
Live like a child, think like the god.

2010-01-24 16:26:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

Le 24/01/2010 16:25, Bruno Prémont a écrit :
>
> In addition to previous mail, I'm also dumping the result of
> vlan_dev_info(dev) shows that the returned pointer is not the same
> during vlan_dev_init() and vlan_skb_recv() ...
>

>
> This might explain the NULL rx_stats pointer, but why do there exist
> two distinct vlan_dev_info(dev)? (unless in one case dev would be
> the physical network device and in the other case it would be vlan device?
> that is lan versus lan.658 in my case...)
>


Ok, I believe I found the problem :)

Please try following patch, if it works I'll submit with appropriate Changelog and signatures

Thanks


diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b788978..c1b92ca 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -163,7 +163,7 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
goto err_unlock;
}

- rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
+ rx_stats = per_cpu_ptr(vlan_dev_info(skb->dev)->vlan_rx_stats,
smp_processor_id());
rx_stats->rx_packets++;
rx_stats->rx_bytes += skb->len;

2010-01-24 16:27:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

Le 24/01/2010 17:25, Américo Wang a écrit :
> On Sun, Jan 24, 2010 at 04:25:49PM +0100, Bruno Prémont wrote:
>> On Sun, 24 January 2010 Eric Dumazet <[email protected]> wrote:
>>> Le 23/01/2010 22:31, Bruno Prémont a écrit :
>>>>> Above part of code did change between 2.6.32 and 2.6.33-rc5 with
>>>>> commit 9793241fe92f7d9303fb221e43fc598eb065f267 (vlan: Precise RX
>>>>> stats accounting)
>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9793241fe92f7d9303fb221e43fc598eb065f267
>>>>
>>>> Reverting just that commit gets the system running correctly.
>>>>
>>>> Bruno
>>>
>>> I have no idea how this patch can break vlan networking.
>>>
>>> Your disassembly and .config seems to show your machine is not SMP
>>
>> Exact
>>
>>> Maybe something is broken on UP and alloc_percpu() ?
>>
>> Apparently not, see below and previous mail
>>
>>> Could you add a debug in vlan_dev_init()
>>
>> In addition to previous mail, I'm also dumping the result of
>> vlan_dev_info(dev) shows that the returned pointer is not the same
>> during vlan_dev_init() and vlan_skb_recv() ...
>>
>> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
>> index b788978..f370ce1 100644
>> --- a/net/8021q/vlan_dev.c
>> +++ b/net/8021q/vlan_dev.c
>> @@ -165,8 +165,11 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
>>
>> rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
>> smp_processor_id());
>
>
> I am thinking if vlan_dev_info(dev) here should be
> vlan_dev_info(skb->dev)...
>
>

Yes, I came to same conclusion :)

Thanks

2010-01-24 19:18:39

by Bruno Prémont

[permalink] [raw]
Subject: Re: [2.6.33-rc5 regression] NULL pointer dereference in vlan_skb_recv - probably introduced by commit 9793241fe92f7d9303fb221e43fc598eb065f267

Yep, works fine and RX stats look right as well.

So
Reported-By: Bruno Prémont <[email protected]>
Tested-By: Bruno Prémont <[email protected]>

Thanks,
Bruno


On Sun, 24 January 2010 Eric Dumazet <[email protected]> wrote:
> Ok, I believe I found the problem :)
>
> Please try following patch, if it works I'll submit with appropriate
> Changelog and signatures
>
> Thanks
>
>
> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
> index b788978..c1b92ca 100644
> --- a/net/8021q/vlan_dev.c
> +++ b/net/8021q/vlan_dev.c
> @@ -163,7 +163,7 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
> goto err_unlock;
> }
>
> - rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
> + rx_stats = per_cpu_ptr(vlan_dev_info(skb->dev)->vlan_rx_stats,
> smp_processor_id());
> rx_stats->rx_packets++;
> rx_stats->rx_bytes += skb->len;

2010-01-24 19:43:14

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] vlan: fix vlan_skb_recv()

Le 24/01/2010 20:18, Bruno Prémont a écrit :
> Yep, works fine and RX stats look right as well.
>
> So
> Reported-By: Bruno Prémont <[email protected]>
> Tested-By: Bruno Prémont <[email protected]>
>
> Thanks,

Thanks very much Bruno, here is the official patch submission then.

[PATCH] vlan: fix vlan_skb_recv()

Bruno Prémont found commit 9793241fe92f7d930
(vlan: Precise RX stats accounting) added a regression for non
hw accelerated vlans.

[ 26.390576] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 26.396369] IP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q]

vlan_dev_info() was used with original device, instead of
skb->dev. Also spotted by Américo Wang.

Reported-By: Bruno Prémont <[email protected]>
Tested-By: Bruno Prémont <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---
net/8021q/vlan_dev.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b788978..c1b92ca 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -163,7 +163,7 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
goto err_unlock;
}

- rx_stats = per_cpu_ptr(vlan_dev_info(dev)->vlan_rx_stats,
+ rx_stats = per_cpu_ptr(vlan_dev_info(skb->dev)->vlan_rx_stats,
smp_processor_id());
rx_stats->rx_packets++;
rx_stats->rx_bytes += skb->len;

2010-01-25 03:52:27

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] vlan: fix vlan_skb_recv()

From: Eric Dumazet <[email protected]>
Date: Sun, 24 Jan 2010 20:43:06 +0100

> [PATCH] vlan: fix vlan_skb_recv()
>
> Bruno Pr?mont found commit 9793241fe92f7d930
> (vlan: Precise RX stats accounting) added a regression for non
> hw accelerated vlans.
>
> [ 26.390576] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 26.396369] IP: [<df856b89>] vlan_skb_recv+0x89/0x280 [8021q]
>
> vlan_dev_info() was used with original device, instead of
> skb->dev. Also spotted by Am?rico Wang.
>
> Reported-By: Bruno Pr?mont <[email protected]>
> Tested-By: Bruno Pr?mont <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>

Applied, thanks everyone.