2005-09-12 21:56:26

by Nuutti Kotivuori

[permalink] [raw]
Subject: netfilter QUEUE target and packet socket interactions buggy or not

I am in the process of debugging a kernel panic manifested on a Red
Hat Enterprise Linux 4 under rather difficult conditions. While
investigating this, I came upon a few bits of code that I'd like some
clarification on. However, I will start by describing the problem.

I am getting a consistent kernel panic under specific high load, which
involves heavy use of the netfilter QUEUE target and packet filter. I
will paraphrase the important parts of the backtrace here:

,----
| Unable to handle kernel NULL pointer dereference at virtual address 00000018
| ...
| __kfree_skb+0xf4/0xf7
| [<c02c3188>] packet_rcv+0x2ca/0x2d4
| [<f888f792>] bcm5700_start_xmit+0x477/0x4a5 [bcm5700]
| [<c01a3a02>] selinux_ipv4_postroute_last+0xf/0x13
| ...
| [<c028cf66>] dst_output+0xf/0x1a
| [<c027cfdb>] nf_reinject+0x14d/0x1a9
| [<f894101e>] ipq_issue_verdict+0x1e/0x2b [ip_queue]
| ...
| [<c028592c>] netlink_sendmsg+0x254/0x263
| [<c011dcf5>] __wake_up+0x29/0x3c
| [<c026b92d>] sock_sendmsg+0xdb/0xf7
| ...
| [<c0133b04>] unqueue_me+0x73/0x79
| [<c011dcf5>] __wake_up+0x29/0x3c
| [<c026d465>] sys_socketcall+0x1c1/0x1dd
| [<c0125351>] sys_gettimeofday+0x53/0xac
| [<c02c7377>] syscall_call+0x7/0xb
`----

So what I gather is happening here is that we are in syscall context,
inside that the nf_reinject stuff puts the queued packet decision
received from userspace onwards and it ends up being captured by a
packet socket. And for some reason, the packet ends up being
kfree_skb'd twice.

Two things caught my attention. First of all, there was a relatively
recent fix to ip_queue which had to do with the calling context. I
will copy the rationale here:

,----[ Harald Welte <laforge at netfilter.org> ]
| [NETFILTER]: Fix deadlock with ip_queue and tcp local input path.
|
| When we have ip_queue being used from LOCAL_IN, then we end up with a
| situation where the verdicts coming back from userspace traverse the TCP
| input path from syscall context. While this seems to work most of the
| time, there's an ugly deadlock:
|
| syscall context is interrupted by the timer interrupt. When the timer
| interrupt leaves, the timer softirq get's scheduled and calls
| tcp_delack_timer() and alike. They themselves do bh_lock_sock(sk),
| which is already held from somewhere else -> boom.
|
| I've now tested the suggested solution by Patrick McHardy and Herbert
| Xu to simply use local_bh_{en,dis}able().
`----

Second, I went looking at the packet socket code and found this
comment:

,----
| This function makes lazy skb cloning in hope that most of packets
| are discarded by BPF.
|
| Note tricky part: we DO mangle shared skb! skb->data, skb->len
| and skb->cb are mangled. It works because (and until) packets
| falling here are owned by current CPU. Output packets are cloned
| by dev_queue_xmit_nit(), input packets are processed by net_bh
| sequencially, so that if we return skb to original state on exit,
| we will not harm anyone.
`----

But are those assumptions valid in the obscure case of us being in the
syscall context, receiving a queued packet from userspace? In any
case, by looking at the disassembly and at the stacktrace, it seems
that the incoming skb is not shared and gets dropped by one of the
goto clauses. The crashing call is the kfree_skb at the very end of
the af_packet.c:packet_rcv function.

I am putting this mail here as a heads up if someone manages to
instantly spot what's wrong with this setup. I will continue debugging
the real cause, and eliminating all the possible variables, seeing
whether this is an SMP problem, checking if it can be manifested with
a vanilla kernel and such.

A detailed dump of the crash can be found at:

https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=118541

Warm fuzzies,
-- Naked


2005-09-12 22:11:16

by David Miller

[permalink] [raw]
Subject: Re: netfilter QUEUE target and packet socket interactions buggy or not

From: Nuutti Kotivuori <[email protected]>
Date: Tue, 13 Sep 2005 01:12:26 +0300

> ,----
> | Unable to handle kernel NULL pointer dereference at virtual address 00000018
> | ...
> | __kfree_skb+0xf4/0xf7
> | [<c02c3188>] packet_rcv+0x2ca/0x2d4
> | [<f888f792>] bcm5700_start_xmit+0x477/0x4a5 [bcm5700]

Please use the tg3 driver that actually comes with
the kernel :-)

2005-09-12 22:15:05

by Nuutti Kotivuori

[permalink] [raw]
Subject: Re: netfilter QUEUE target and packet socket interactions buggy or not

David S. Miller wrote:
> From: Nuutti Kotivuori <[email protected]>
> Date: Tue, 13 Sep 2005 01:12:26 +0300
>
>> ,----
>> | Unable to handle kernel NULL pointer dereference at virtual address 00000018
>> | ...
>> | __kfree_skb+0xf4/0xf7
>> | [<c02c3188>] packet_rcv+0x2ca/0x2d4
>> | [<f888f792>] bcm5700_start_xmit+0x477/0x4a5 [bcm5700]
>
> Please use the tg3 driver that actually comes with
> the kernel :-)

The problem also appears with the tg3 driver. In some other crashes,
the traceback looks like:

,----
| __kfree_skb+0xf4/0xf7
| [<c02c3188>] packet_rcv+0x2ca/0x2d4
| [<c0273ca8>] dev_queue_xmit_nit+0xc1/0xd3
| [<c01a3a02>] selinux_ipv4_postroute_last+0xf/0x13
`----

I am doubtful the network card driver would be at fault here, but
that'll be confirmed once I manage to narrow this down more.

-- Naked

2005-09-13 10:33:03

by Nuutti Kotivuori

[permalink] [raw]
Subject: Re: netfilter QUEUE target and packet socket interactions buggy or not

Nuutti Kotivuori wrote:
> David S. Miller wrote:
>> Please use the tg3 driver that actually comes with
>> the kernel :-)

[...]

> I am doubtful the network card driver would be at fault here, but
> that'll be confirmed once I manage to narrow this down more.

Appended here is a backtrace with the tg3 driver. Also, it seems that
the bug cannot be reproduced with uniprocessor, only SMP.

Unable to handle kernel NULL pointer dereference at virtual address 00000018
printing eip:
c01a387f
*pde = 36ee0001
Oops: 0000 [#1]
SMP
Modules linked in: arpt_mangle arptable_filter arp_tables iptable_filter ip_tables ip_queue parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc dm_mod button battery acEIP is at selinux_ip_postroute_last+0x6a/0x1de
eax: 00000000 ebx: 00000000 ecx: f7a91bb0 edx: 00000003
esi: efb8f080 edi: c0455780 ebp: 00000004 esp: f7a91b8c
ds: 007b es: 007b ss: 0068
Process dispatcher (pid: 2748, threadinfo=f7a91000 task=f31ed830)
Stack: 00000000 e9011280 00000000 e9bfdb80 00000002 f88a965a 01ade49e 00000000
00000206 000000f5 f88a983c c026f163 f6878680 f7538b68 c02c3188 000000c7
00000042 00000206 [<f88a965a>] tg3_start_xmit+0x27e/0x476 [tg3]
[<f88a983c>] tg3_start_xmit+0x460/0x476 [tg3] __kfree_skb+0xf4/0xf7
[<c02c3188>] packet_rcv+0x2ca/0x2d4
[<c011d270>] find_busiest_group+0xf1/0x2e0
[<c01a3a02>] selinux_ipv4_postroute_last+0xf/0x13
[<c028d11f>] ip_finish_output2+0x0/0x16d
[<c027cb23>] nf_iterate+0x40/0x81
[<c028d11f>] ip_finish_output2+0x0/0x16d
[<c027ce21>] nf_hook_slow+0x47/0xb4
[<c028d11f>] ip_finish_output2+0x0/0x16d
[<c028d116>] ip_finish_output+0x1a5/0x1ae
[<c028d11f>] ip_finish_output2+0x0/0x16d
[<c028cf66>] dst_output+0xf/0x1a
[<c027cfdb>] nf_reinject+0x14d/0x1a9
[<f891401e>] ipq_issue_verdict+0x1e/0x2b [ip_queue]
[<f8914676>] ipq_set_verdict+0x53/0x5a [ip_queue]
[<f891472c>] ipq_receive_peer+0x3d/0x46 [ip_queue]
[<f891487d>] ipq_rcv_sk+0xfc/0x175 [ip_queue]
[<c0285b11>] netlink_data_ready+0x14/0x44
[<c028525b>] netlink_sendskb+0x52/0x6c
[<c028592c>] netlink_sendmsg+0x254/0x263
[<c011dcf5>] __wake_up+0x29/0x3c
[<c026b92d>] sock_sendmsg+0xdb/0xf7
[<c0285ae9>] netlink_recvmsg+0x1ae/0x1c2
[<c026ba64>] sock_recvmsg+0xef/0x10c
[<c011f6ee>] autoremove_wake_function+0x0/0x2d
[<c02709ba>] verify_iovec+0x76/0xc2
[<c026d07c>] sys_sendmsg+0x1ee/0x23b
[<c026b4fe>] move_addr_to_user+0x67/0x7f
[<c02c7d56>] reschedule_interrupt+0x1a/0x20
[<c01116de>] sched_clock+0x46/0x73
[<c011caf1>] finish_task_switch+0x30/0x66
[<c02c5604>] schedule+0x844/0x87a
[<c026d465>] sys_socketcall+0x1c1/0x1dd
[<c0125351>] sys_gettimeofday+0x53/0xac
[<c02c7377>] syscall_call+0x7/0xb
[<c02c007b>] unix_release_sock+0x15a/0x201
Code: 89 d3 83 c3 2c 0f 84 8c 01 00 00 8b 44 24 7c 31 c9 8d 54 24 24 e8 df 29 00 00 85 c0 0f 85

-- Naked

2005-09-13 16:33:53

by Patrick McHardy

[permalink] [raw]
Subject: Re: netfilter QUEUE target and packet socket interactions buggy or not

Nuutti Kotivuori wrote:
>
> Appended here is a backtrace with the tg3 driver. Also, it seems that
> the bug cannot be reproduced with uniprocessor, only SMP.
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000018

This means inode->i_security was NULL. AFAICT it is only set to NULL in
inode_free_security() when the inode is freed. This shouldn't happen
while the packet is queued since the skb should hold a reference to
the socket on the output path. So it could be some protocol forgetting
to increase the refcnt when taking a reference. What kind of packet
is this? And what kernel version are you running? Until recently
ip_conntrack did some fiddling with skb->sk which could lead to
a packet on the output path with skb->sk set but no reference taken.

2005-09-13 17:57:24

by Nuutti Kotivuori

[permalink] [raw]
Subject: Re: netfilter QUEUE target and packet socket interactions buggy or not

Patrick McHardy wrote:
> Nuutti Kotivuori wrote:
>>
>> Appended here is a backtrace with the tg3 driver. Also, it seems that
>> the bug cannot be reproduced with uniprocessor, only SMP.
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 00000018
>
> This means inode->i_security was NULL. AFAICT it is only set to NULL in
> inode_free_security() when the inode is freed. This shouldn't happen
> while the packet is queued since the skb should hold a reference to
> the socket on the output path. So it could be some protocol forgetting
> to increase the refcnt when taking a reference.

Right.

> What kind of packet is this? And what kernel version are you
> running? Until recently ip_conntrack did some fiddling with skb->sk
> which could lead to a packet on the output path with skb->sk set but
> no reference taken.

This happens on Red Hat Enterprise Linux 4, with a 2.6.9 kernel (with
a gazillion of Red Hat patches in it, latest ones being from 2.6.11)
and the ip_queue patch that adds the bottom-half disabling. I will
know for sure tomorrow, but it seems that it doesn't appear on vanilla
2.6.13.1 or without SMP.

It is very hard to know which packet specifically triggers this. The
machine is under heavy load in general, a lot of packets are handled
via a QUEUE target, and some packets are captured via packet socket.

I will post more details tomorrow, but if you could point me towards
the changes in ip_conntrack that affected this, it would be very
helpful. I could check if they are in the Red Hat kernel and if not,
patch them manually and see if it makes a difference. The problem is
now reproduciable in a couple hours, so it shouldn't be too hard.

Thanks,
-- Naked

2005-09-14 10:53:30

by Nuutti Kotivuori

[permalink] [raw]
Subject: Re: netfilter QUEUE target and packet socket interactions buggy or not

Nuutti Kotivuori wrote:
> I am in the process of debugging a kernel panic manifested on a Red
> Hat Enterprise Linux 4 under rather difficult conditions. While
> investigating this, I came upon a few bits of code that I'd like some
> clarification on. However, I will start by describing the problem.

Just as a heads up, I have now confirmed that the problem does not
happen on vanilla 2.6.13.1. The config is a bit different, but mostly
the same - although I had to disable SELinux due to other reasons, so
that might still be the culprit. Or the bug may have been fixed, or
never existed, in the mainline kernel.

-- Naked