I am sorry if this posts in HTML.
Some background information on my system.
Tyan S1863D (Tomcat IIID) motherboard, dual socket 7.
128mb 60ns EDO ram (8x16mb)
Dual Pentium 200mhz processors, non-mmx.
PIIX3 IDE Controller, onboard
Single 1.2gb drive on each of the two channels.
Advansys ISA SCSI Controller
Seagate SCSI HD, 9.1gb (Ultra Wide? not sure)
NE2000 Compatible ISA 10mbit ethernet nic IRQ 11 IO 0x300
Kingston KNE100TX PCI 100mbit fast ethernet nic, (uses tulip.o) IRQ 12 IO
0x6400
I am running a custom built linux system using linuxfromscratch, everything
except the kernel was compiled -O3 -march=i586. I am running kernel 2.4.6,
glibc 2.2.1, gcc 2.95.3, iptables 1.2.2. The NE2000 nic connects to my DSL
modem (swbell is my ISP) and the KNE100TX connects to my 100mbit switched
lan. The first IDE drive contains a small ext2 /boot partition, a swap
partition, and a linux raid autodetect partition. The second IDE drive has
a swap partition and a linux raid autodetect partition exactly the same size
as the partition on the first drive. These two partitions are configured
for software RAID0, with a reiserfs filesystem and mounted on /. The scsi
drive has a reiserfs and a swap partition. The reiserfs partition is mounted
on /home and exported to other linux systems on the lan via knfsd, using
nfsv3 support. There are no other directories or filesystems exported. The
kernel has SMP, software raid0, iptables, pppoe, nfs, reiserfs, ext2, and
all drivers except floppy.o compiled in. My .config file I use for 2.4.6
can be found here: http://saai.dyn.dhs.org/kernel-2.4.6.config. As far as
software running on the system, I have named, pppd (the patched version for
pppoe support), rpc.portmap, rpc.nfsd, rpc.mountd, apache, and mysql along
with telnet, rsh and proftpd running through inetd. As far as usage of the
system goes, I have its /home export mounted as /home on my other two linux
systems (all linuxfromscratch, same version software), I play mp3's residing
on the /home mount over nfs in linux, and via Apache using Winamp in
windows. This system acts as router for my lan, connected to an ADSL modem.
My problem is, that I seem to be having "random" kernel panics. When I say
random I mean that the system will run 30 minutes before it panics, or as
long as a week before it panics. System load and type of load does not seem
to matter, the system will panic in the middle of the night when there is
little/no cpu/network activity, or during very heavy multiuser cpu and
network usage -- it does not seem to matter. Always, the panic is pretty
much the same, occuring in the Process swapper. I believe that the problem
is related to iptables. I have a windows 98SE machine on my lan, that
accesses the internet though this linux system acting as a router. If I try
to connect to a server using Napster/Napigator on the windows box through
the linux router box I am having problems with, the system always panics.
When I used a 2.4.4 kernel with the reiserfs-nfs patch the system would
panic when I tried to check my email using Outlook Express 5.0 on the
windows box on my lan. Outlook no longer causes a panic, but rather
Napster/Napigator.
Below is the panic output, captured by setting the system console to a
serial port and using hyperterm on my windows box via a 115200kbps serial
link:
Unable to handle kernel paging request at virtual address 0000d7c5
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c01d6fc3>]
EFLAGS: 00010206
eax: c166ae00 ebx: 0000d7c5 ecx: 00000000 edx: 0000d7c5
esi: c7b28b40 edi: c166aba0 ebp: 00000060 esp: c1253d8c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c1253000)
Stack: fffffe00 c01d706b c7b28b40 fffffe00 c7b28b40 c01d7653 c7b28b40
c12ad800
c7b28b40 0000000e c7b28b40 ffffffe6 c01da0f7 c7b28b40 00000020
00000004
c7058f40 0000000e c01ddfed c7b28b40 00000001 00000000 c7b28b40
c01e7ea0
Call Trace: [<c01d706b>] [<c01d7653>] [<c01da0f7>] [<c01ddfed>] [<c01e7ea0>]
[<c01e7f60>] [<c01df228>]
[<c01e5450>] [<c01e7e83>] [<c01e7ea0>] [<c01e54aa>] [<c01df228>]
[<c01e05dc>] [<c01e53e4>] [<c01e5450>]
[<c01e4488>] [<c01e462a>] [<c01e4488>] [<c01df228>] [<c01e42c7>]
[<c01e4488>] [<c01da8ce>] [<c0116c8a>]
[<c0108680>] [<c0105180>] [<c0106d40>] [<c0105180>] [<c01051ac>]
[<c0105212>] [<c0108680>] [<c0220d4a>]
[<c019d05f>]
Code: 8b 1b 8b 42 70 83 f8 01 74 0b f0 ff 4a 70 0f 94 c0 84 c0 74
Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
Like I stated before, it is always the Process swapper. This panic was
caused by using Napster/Napigator, as opposed to a seemingly random panic.
I took the above panic output and pasted it into ksymoops:
>>EIP; c01d6fc3 <skb_drop_fraglist+17/3c> <=====
Trace; c01d706b <skb_release_data+5f/74>
Trace; c01d7653 <skb_linearize+cf/130>
Trace; c01da0f7 <dev_queue_xmit+27/2e8>
Trace; c01ddfed <neigh_resolve_output+1cd/240>
Trace; c01e7ea0 <ip_finish_output2+0/f8>
Trace; c01e7f60 <ip_finish_output2+c0/f8>
Trace; c01df228 <nf_hook_slow+110/194>
Trace; c01e5450 <ip_forward_finish+0/60>
Trace; c01e7e83 <ip_finish_output+127/130>
Trace; c01e7ea0 <ip_finish_output2+0/f8>
Trace; c01e54aa <ip_forward_finish+5a/60>
Trace; c01df228 <nf_hook_slow+110/194>
Trace; c01e05dc <rt_check_expire__thr+154/184>
Trace; c01e53e4 <ip_forward+1b4/220>
Trace; c01e5450 <ip_forward_finish+0/60>
Trace; c01e4488 <ip_rcv_finish+0/1e8>
Trace; c01e462a <ip_rcv_finish+1a2/1e8>
Trace; c01e4488 <ip_rcv_finish+0/1e8>
Trace; c01df228 <nf_hook_slow+110/194>
Trace; c01e42c7 <ip_rcv+367/3b0>
Trace; c01e4488 <ip_rcv_finish+0/1e8>
Trace; c01da8ce <net_rx_action+16a/274>
Trace; c0116c8a <do_softirq+5a/88>
Trace; c0108680 <do_IRQ+e0/f0>
Trace; c0105180 <default_idle+0/34>
Trace; c0106d40 <ret_from_intr+0/7>
Trace; c0105180 <default_idle+0/34>
Trace; c01051ac <default_idle+2c/34>
Trace; c0105212 <cpu_idle+3e/54>
Trace; c0108680 <do_IRQ+e0/f0>
Trace; c0220d4a <vsprintf+31e/34c>
Trace; c019d05f <serial_console_write+2b/194>
Code; c01d6fc3 <skb_drop_fraglist+17/3c>
00000000 <_EIP>:
Code; c01d6fc3 <skb_drop_fraglist+17/3c> <=====
0: 8b 1b mov (%ebx),%ebx <=====
Code; c01d6fc5 <skb_drop_fraglist+19/3c>
2: 8b 42 70 mov 0x70(%edx),%eax
Code; c01d6fc8 <skb_drop_fraglist+1c/3c>
5: 83 f8 01 cmp $0x1,%eax
Code; c01d6fcb <skb_drop_fraglist+1f/3c>
8: 74 0b je 15 <_EIP+0x15> c01d6fd8
<skb_drop_fraglist+2c/3c>
Code; c01d6fcd <skb_drop_fraglist+21/3c>
a: f0 ff 4a 70 lock decl 0x70(%edx)
Code; c01d6fd1 <skb_drop_fraglist+25/3c>
e: 0f 94 c0 sete %al
Code; c01d6fd4 <skb_drop_fraglist+28/3c>
11: 84 c0 test %al,%al
Code; c01d6fd6 <skb_drop_fraglist+2a/3c>
13: 74 00 je 15 <_EIP+0x15> c01d6fd8
<skb_drop_fraglist+2c/3c>
Kernel panic: Aiee, killing interrupt handler!
Which, by my limited understanding, seems to indicate iptables/softirq or
smp. My obvious question is, is this a bug? Is there a fix for it? I have
had this problem since kernel 2.4.3, when I first set this machine up as a
router. Below are some links with more information that might be useful,
and if theres anything else I can provide please ask.
dmesg output from bootup: http://saai.dyn.dhs.org/dmesg.txt
ifconfig & route output and iptables config:
http://saai.dyn.dhs.org/network.txt
kernel 2.4.6 .config file: http://saai.dyn.dhs.org/kernel-2.4.6.config
panic and ksymoops output: http://saai.dyn.dhs.org/panic.txt
packages installed on my linux systems: http://saai.dyn.dhs.org/packages/
(In case you would like to know version numbers of any other software I
have)
Andrew Friedley writes:
> >>EIP; c01d6fc3 <skb_drop_fraglist+17/3c> <=====
> Trace; c01d706b <skb_release_data+5f/74>
> Trace; c01d7653 <skb_linearize+cf/130>
> Trace; c01da0f7 <dev_queue_xmit+27/2e8>
> Trace; c01ddfed <neigh_resolve_output+1cd/240>
This report has been plagueing us for a month or two now, from
different people. But we hadn't been able to track it down.
Initially we believed it might be some obscure bug in netfilter
which got triggered more easily when the zerocopy stuff went in.
But all of our code audits turned up nothing.
Then I began to notice that "pppoe" was showing up in all the reports
where the user actually bothered to mention what net devices the
machine was using when it crashed.
I spent the past few days auditing the driver and I think I figured
out what was causing the OOPS (along with some other bugs I found
along the way).
I have no way to test out these changes, so can folks do that for me?
If I didn't screw something else up, then I'm pretty sure the OOPS
will go away. Let me know if something goes wrong due to these
changes.
Thanks.
--- drivers/net/pppoe.c.~1~ Mon Jul 9 17:52:06 2001
+++ drivers/net/pppoe.c Tue Jul 17 20:46:24 2001
@@ -5,7 +5,7 @@
* PPPoE --- PPP over Ethernet (RFC 2516)
*
*
- * Version: 0.6.6
+ * Version: 0.6.7
*
* 030700 : Fixed connect logic to allow for disconnect.
* 270700 : Fixed potential SMP problems; we must protect against
@@ -20,10 +20,16 @@
* 111100 : Fix recvmsg.
* 050101 : Fix PADT procesing.
* 140501 : Use pppoe_rcv_core to handle all backlog. (Alexey)
+ * 170701 : Do not lock_sock with rwlock held. (DaveM)
+ * Ignore discovery frames if user has socket
+ * locked. (DaveM)
+ * Ignore return value of dev_queue_xmit in __pppoe_xmit
+ * or else we may kfree an SKB twice. (DaveM)
*
* Author: Michal Ostrowski <[email protected]>
* Contributors:
* Arnaldo Carvalho de Melo <[email protected]>
+ * David S. Miller ([email protected])
*
* License:
* This program is free software; you can redistribute it and/or
@@ -100,10 +106,11 @@
static int hash_item(unsigned long sid, unsigned char *addr)
{
- char hash=0;
- int i,j;
- for (i = 0; i < ETH_ALEN ; ++i){
- for (j = 0; j < 8/PPPOE_HASH_BITS ; ++j){
+ char hash = 0;
+ int i, j;
+
+ for (i = 0; i < ETH_ALEN ; ++i) {
+ for (j = 0; j < 8/PPPOE_HASH_BITS ; ++j) {
hash ^= addr[i] >> ( j * PPPOE_HASH_BITS );
}
}
@@ -188,7 +195,7 @@
read_lock_bh(&pppoe_hash_lock);
po = __get_item(sid, addr);
- if(po)
+ if (po)
sock_hold(po->sk);
read_unlock_bh(&pppoe_hash_lock);
@@ -233,62 +240,83 @@
* Certain device events require that sockets be unconnected.
*
**************************************************************************/
+
+static void pppoe_flush_dev(struct net_device *dev)
+{
+ int hash;
+
+ if (dev == NULL)
+ BUG();
+
+ read_lock_bh(&pppoe_hash_lock);
+ for (hash = 0; hash < PPPOE_HASH_SIZE; hash++) {
+ struct pppox_opt *po = item_hash_table[hash];
+
+ while (po != NULL) {
+ if (po->pppoe_dev == dev) {
+ struct sock *sk = po->sk;
+
+ sock_hold(sk);
+ po->pppoe_dev = NULL;
+
+ /* We hold a reference to SK, now drop the
+ * hash table lock so that we may attempt
+ * to lock the socket (which can sleep).
+ */
+ read_unlock_bh(&pppoe_hash_lock);
+
+ lock_sock(sk);
+
+ if (sk->state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+ pppox_unbind_sock(sk);
+ dev_put(dev);
+ sk->state = PPPOX_DEAD;
+ sk->state_change(sk);
+ }
+
+ release_sock(sk);
+
+ sock_put(sk);
+
+ read_lock_bh(&pppoe_hash_lock);
+
+ /* Now restart from the beginning of this
+ * hash chain. We always NULL out pppoe_dev
+ * so we are guarenteed to make forward
+ * progress.
+ */
+ po = item_hash_table[hash];
+ continue;
+ }
+ po = po->next;
+ }
+ }
+ read_unlock_bh(&pppoe_hash_lock);
+}
+
static int pppoe_device_event(struct notifier_block *this,
unsigned long event, void *ptr)
{
- int error = NOTIFY_DONE;
struct net_device *dev = (struct net_device *) ptr;
- struct pppox_opt *po = NULL;
- int hash = 0;
/* Only look at sockets that are using this specific device. */
switch (event) {
case NETDEV_CHANGEMTU:
- /* A change in mtu is a bad thing, requiring
- * LCP re-negotiation.
- */
+ /* A change in mtu is a bad thing, requiring
+ * LCP re-negotiation.
+ */
+
case NETDEV_GOING_DOWN:
case NETDEV_DOWN:
-
/* Find every socket on this device and kill it. */
- read_lock_bh(&pppoe_hash_lock);
-
- while (!po && hash < PPPOE_HASH_SIZE){
- po = item_hash_table[hash];
- ++hash;
- }
-
- while (po && hash < PPPOE_HASH_SIZE){
- if(po->pppoe_dev == dev){
- lock_sock(po->sk);
- if (po->sk->state & (PPPOX_CONNECTED|PPPOX_BOUND)){
- pppox_unbind_sock(po->sk);
-
- dev_put(po->pppoe_dev);
- po->pppoe_dev = NULL;
-
- po->sk->state = PPPOX_DEAD;
- po->sk->state_change(po->sk);
- }
- release_sock(po->sk);
- }
- if (po->next) {
- po = po->next;
- } else {
- po = NULL;
- while (!po && hash < PPPOE_HASH_SIZE){
- po = item_hash_table[hash];
- ++hash;
- }
- }
- }
- read_unlock_bh(&pppoe_hash_lock);
+ pppoe_flush_dev(dev);
break;
+
default:
break;
};
- return error;
+ return NOTIFY_DONE;
}
@@ -304,40 +332,39 @@
* Do the real work of receiving a PPPoE Session frame.
*
***********************************************************************/
-int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb){
- struct pppox_opt *po=sk->protinfo.pppox;
+int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
+{
+ struct pppox_opt *po = sk->protinfo.pppox;
struct pppox_opt *relay_po = NULL;
if (sk->state & PPPOX_BOUND) {
skb_pull(skb, sizeof(struct pppoe_hdr));
-
ppp_input(&po->chan, skb);
- } else if( sk->state & PPPOX_RELAY ){
-
- relay_po = get_item_by_addr( &po->pppoe_relay );
+ } else if (sk->state & PPPOX_RELAY) {
+ relay_po = get_item_by_addr(&po->pppoe_relay);
- if( relay_po == NULL ||
- !( relay_po->sk->state & PPPOX_CONNECTED ) ){
- goto abort;
- }
+ if (relay_po == NULL)
+ goto abort_kfree;
+
+ if ((relay_po->sk->state & PPPOX_CONNECTED) == 0)
+ goto abort_put;
skb_pull(skb, sizeof(struct pppoe_hdr));
- if( !__pppoe_xmit( relay_po->sk , skb) ){
- goto abort;
- }
+ if (!__pppoe_xmit( relay_po->sk , skb))
+ goto abort_put;
} else {
sock_queue_rcv_skb(sk, skb);
}
- return 1;
-abort:
- if(relay_po)
- sock_put(relay_po->sk);
- return 0;
-
-}
+ return NET_RX_SUCCESS;
+abort_put:
+ sock_put(relay_po->sk);
+abort_kfree:
+ kfree_skb(skb);
+ return NET_RX_DROP;
+}
/************************************************************************
*
@@ -356,24 +383,25 @@
po = get_item((unsigned long) ph->sid, skb->mac.ethernet->h_source);
- if(!po){
+ if (!po) {
kfree_skb(skb);
- return 0;
+ return NET_RX_DROP;
}
sk = po->sk;
bh_lock_sock(sk);
/* Socket state is unknown, must put skb into backlog. */
- if( sk->lock.users != 0 ){
- sk_add_backlog( sk, skb);
- ret = 1;
- }else{
+ if (sk->lock.users != 0) {
+ sk_add_backlog(sk, skb);
+ ret = NET_RX_SUCCESS;
+ } else {
ret = pppoe_rcv_core(sk, skb);
}
bh_unlock_sock(sk);
sock_put(sk);
+
return ret;
}
@@ -390,24 +418,31 @@
{
struct pppoe_hdr *ph = (struct pppoe_hdr *) skb->nh.raw;
struct pppox_opt *po;
- struct sock *sk = NULL;
if (ph->code != PADT_CODE)
goto abort;
po = get_item((unsigned long) ph->sid, skb->mac.ethernet->h_source);
+ if (po) {
+ struct sock *sk = po->sk;
- if (!po)
- goto abort;
+ bh_lock_sock(sk);
- sk = po->sk;
+ /* If the user has locked the socket, just ignore
+ * the packet. With the way two rcv protocols hook into
+ * one socket family type, we cannot (easily) distinguish
+ * what kind of SKB it is during backlog rcv.
+ */
+ if (sk->lock.users == 0)
+ pppox_unbind_sock(sk);
- pppox_unbind_sock(sk);
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ }
- sock_put(sk);
- abort:
+abort:
kfree_skb(skb);
- return 0;
+ return NET_RX_SUCCESS; /* Lies... :-) */
}
struct packet_type pppoes_ptype = {
@@ -524,7 +559,7 @@
struct sock *sk = sock->sk;
struct net_device *dev = NULL;
struct sockaddr_pppox *sp = (struct sockaddr_pppox *) uservaddr;
- struct pppox_opt *po=sk->protinfo.pppox;
+ struct pppox_opt *po = sk->protinfo.pppox;
int error;
lock_sock(sk);
@@ -569,8 +604,9 @@
po->pppoe_dev = dev;
- if( ! (dev->flags & IFF_UP) )
+ if (!(dev->flags & IFF_UP))
goto err_put;
+
memcpy(&po->pppoe_pa,
&sp->sa_addr.pppoe,
sizeof(struct pppoe_addr));
@@ -687,7 +723,7 @@
/* PPPoE address from the user specifies an outbound
PPPoE address to which frames are forwarded to */
err = -EFAULT;
- if( copy_from_user(&po->pppoe_relay,
+ if (copy_from_user(&po->pppoe_relay,
(void*)arg,
sizeof(struct sockaddr_pppox)))
break;
@@ -752,7 +788,7 @@
dev = sk->protinfo.pppox->pppoe_dev;
error = -EMSGSIZE;
- if(total_len > dev->mtu+dev->hard_header_len)
+ if (total_len > (dev->mtu + dev->hard_header_len))
goto end;
@@ -775,7 +811,7 @@
ph = (struct pppoe_hdr *) skb_put(skb, total_len + sizeof(struct pppoe_hdr));
start = (char *) &ph->tag[0];
- error = memcpy_fromiovec( start, m->msg_iov, total_len);
+ error = memcpy_fromiovec(start, m->msg_iov, total_len);
if (error < 0) {
kfree_skb(skb);
@@ -793,7 +829,7 @@
dev_queue_xmit(skb);
- end:
+end:
release_sock(sk);
return error;
}
@@ -811,9 +847,8 @@
int headroom = skb_headroom(skb);
int data_len = skb->len;
- if (sk->dead || !(sk->state & PPPOX_CONNECTED)) {
+ if (sk->dead || !(sk->state & PPPOX_CONNECTED))
goto abort;
- }
hdr.ver = 1;
hdr.type = 1;
@@ -821,9 +856,8 @@
hdr.sid = sk->num;
hdr.length = htons(skb->len);
- if (!dev) {
+ if (!dev)
goto abort;
- }
/* Copy the skb if there is no space for the header. */
if (headroom < (sizeof(struct pppoe_hdr) + dev->hard_header_len)) {
@@ -844,6 +878,12 @@
skb = skb2;
}
+ /* We may not return error beyond this point. Potentially this
+ * is a new SKB and in such a case we've already freed the
+ * original SKB. So if we were to return error, our caller would
+ * double free that original SKB. -DaveM
+ */
+
ph = (struct pppoe_hdr *) skb_push(skb, sizeof(struct pppoe_hdr));
memcpy(ph, &hdr, sizeof(struct pppoe_hdr));
skb->protocol = __constant_htons(ETH_P_PPP_SES);
@@ -856,11 +896,10 @@
sk->protinfo.pppox->pppoe_pa.remote,
NULL, data_len);
- if (dev_queue_xmit(skb) < 0)
- goto abort;
-
+ dev_queue_xmit(skb);
return 1;
- abort:
+
+abort:
return 0;
}
"David S. Miller" <[email protected]> writes:
>
> This report has been plagueing us for a month or two now, from
> different people. But we hadn't been able to track it down.
>
> Initially we believed it might be some obscure bug in netfilter
> which got triggered more easily when the zerocopy stuff went in.
> But all of our code audits turned up nothing.
This brings up an interesting point.
One of the initial issues in developing PPPoE support was how to
ensure that the layers passing packets to PPPoE allocated the correct
amount of header space in the skb (so as to avoid a copy of the skb to
create space for PPPoE headers).
This issue was resolved by having the PPP layer respect the header
lengths specified by the underlying transport layers (PPPoE) when
defining dev->hard_header_len. However, just to be on the safe side,
this code that copied the packet was left in place.
My guess is that before zerocopy netfilter, netfilter made an skb that
conformed to the header requirements of PPPoE. Once netfilter stopped
making copies PPPoE was passed skb's without space for PPPoE headers
and thus invoked the code path you've just fixed.
Is it possible for netfilter to pass to the PPP device a packet that
respect's the PPP device's hard_header_len? (This would avoid the copy
in PPPoE.) More generally, I'm concerned (without having seen the
code) that we may have problems when passing skb's between devices via
zerocopy-netfilter when those devices have varying hard_header_len
requirements.
>
> I have no way to test out these changes, so can folks do that for me?
> If I didn't screw something else up, then I'm pretty sure the OOPS
> will go away. Let me know if something goes wrong due to these
> changes.
>
As far as I can tell it all seems fine. (I'd like to hear some
success stories from some of the people using netfilter heavily with
this though who have experienced the oops'es.)
Michal Ostrowski
[email protected]
After sleeping on it a bit I've come to the realization that are
still some issues outstanding.
Essentially, if __pppoe_xmit has been forced to make a copy of the skb
(because netfilter did not leave enough room for PPPoE headers), and
dev_queue_xmit fails, the copy of the skb is not freed and we have a
memory leak.
The generic PPP layer assumes that if a PPP-channel's xmit routine
fails then the skb is still available to it (for retransmission) and
otherwise the skb is gone -- handled by the channel. These are the
semantics that must be implemented by __pppoe_xmit.
The patch below (which requires David Miller's patch from 17/07/01)
implements these semantics.
Michal Ostrowski
[email protected]
--- drivers/net/pppoe.c~ Wed Jul 18 10:24:25 2001
+++ drivers/net/pppoe.c Thu Jul 19 08:28:56 2001
@@ -5,7 +5,7 @@
* PPPoE --- PPP over Ethernet (RFC 2516)
*
*
- * Version: 0.6.7
+ * Version: 0.6.8
*
* 030700 : Fixed connect logic to allow for disconnect.
* 270700 : Fixed potential SMP problems; we must protect against
@@ -25,8 +25,12 @@
* locked. (DaveM)
* Ignore return value of dev_queue_xmit in __pppoe_xmit
* or else we may kfree an SKB twice. (DaveM)
+ * 190701 : When doing copies of skb's in __pppoe_xmit, always delete
+ * the original skb that was passed in on success, never on
+ * failure. Delete the copy of the skb on failure to avoid
+ * a memory leak.
*
- * Author: Michal Ostrowski <[email protected]>
+ * Author: Michal Ostrowski <[email protected]>
* Contributors:
* Arnaldo Carvalho de Melo <[email protected]>
* David S. Miller ([email protected])
@@ -849,6 +853,7 @@
struct pppoe_hdr *ph;
int headroom = skb_headroom(skb);
int data_len = skb->len;
+ struct sk_buff *old_skb = NULL;
if (sk->dead || !(sk->state & PPPOX_CONNECTED))
goto abort;
@@ -876,17 +881,12 @@
skb_reserve(skb2, dev->hard_header_len + sizeof(struct pppoe_hdr));
memcpy(skb_put(skb2, skb->len), skb->data, skb->len);
- skb_unlink(skb);
- kfree_skb(skb);
+ /* Keep a reference to the original */
+ old_skb = skb;
+
skb = skb2;
}
- /* We may not return error beyond this point. Potentially this
- * is a new SKB and in such a case we've already freed the
- * original SKB. So if we were to return error, our caller would
- * double free that original SKB. -DaveM
- */
-
ph = (struct pppoe_hdr *) skb_push(skb, sizeof(struct pppoe_hdr));
memcpy(ph, &hdr, sizeof(struct pppoe_hdr));
skb->protocol = __constant_htons(ETH_P_PPP_SES);
@@ -899,7 +899,34 @@
sk->protinfo.pppox->pppoe_pa.remote,
NULL, data_len);
- dev_queue_xmit(skb);
+ /* The skb we are to transmit may be a copy (see above). If
+ * this fails, then the caller is responsible for the original
+ * skb, otherwise we must free it. Also if this fails we must
+ * free the copy that we made.
+ */
+
+ if (dev_queue_xmit(skb)<0) {
+ if (old_skb) {
+ /* The skb we tried to send was a copy. We
+ * have to free it (the copy) and let the
+ * caller deal with the original one.
+ */
+ skb_unlink(skb);
+ kfree_skb(skb);
+ }
+ goto abort;
+ }
+
+ /* Free original skb now that we know dev_queue_xmit
+ * succeeded. We free only "old_skb" because dev_queue_xmit
+ * actually sent a copy, not the original.
+ */
+
+ if (old_skb) {
+ skb_unlink(old_skb);
+ kfree_skb(old_skb);
+ }
+
return 1;
abort:
@@ -1049,7 +1076,6 @@
int err = register_pppox_proto(PX_PROTO_OE, &pppoe_proto);
if (err == 0) {
- printk(KERN_INFO "Registered PPPoE v0.6.5\n");
dev_add_pack(&pppoes_ptype);
dev_add_pack(&pppoed_ptype);
--- drivers/net/pppox.c~ Tue Feb 13 16:15:05 2001
+++ drivers/net/pppox.c Wed Jul 18 10:27:25 2001
@@ -148,10 +148,6 @@
err = sock_register(&pppox_proto_family);
- if (err == 0) {
- printk(KERN_INFO "Registered PPPoX v0.5\n");
- }
-
return err;
}
Hello!
SOme short comment on the patch:
> - dev_queue_xmit(skb);
> + /* The skb we are to transmit may be a copy (see above). If
> + * this fails, then the caller is responsible for the original
> + * skb, otherwise we must free it. Also if this fails we must
> + * free the copy that we made.
> + */
> +
> + if (dev_queue_xmit(skb)<0) {
dev_queue_xmit _frees_ frame, not depending on return value.
Return value is not a criterium to assume anything.
> + if (old_skb) {
> + /* The skb we tried to send was a copy. We
> + * have to free it (the copy) and let the
> + * caller deal with the original one.
> + */
> + skb_unlink(skb);
So, do you pass to dev_queue_xmit some skb, which is on some list?
Not a good idea. Please, clone it and submit clone, if you need to hold
original in some list.
Alexey
[email protected] writes:
> Hello!
>
> SOme short comment on the patch:
>
>
> > - dev_queue_xmit(skb);
> > + /* The skb we are to transmit may be a copy (see above). If
> > + * this fails, then the caller is responsible for the original
> > + * skb, otherwise we must free it. Also if this fails we must
> > + * free the copy that we made.
> > + */
> > +
> > + if (dev_queue_xmit(skb)<0) {
>
> dev_queue_xmit _frees_ frame, not depending on return value.
> Return value is not a criterium to assume anything.
>
My mistake. It seemed perfectly reasonable at 6:00 am. :-)
However, could we not have dev_queue_xmit behave as such (not free
frame on failure)? That is, could we extend dev_queue_xmit to tell it
(optionally) that we want the skb back in case of failure?
dev_queue_xmit unconditionally frees the skb in any failure mode,
which is I would venture to say that we could do this.
The reason why I'm proposing this is that ppp_generic.c assumes that
the skb is still available after a transmission failure via pppoe. To
support the semantics of dev_queue_xmit and ppp_generic we would have
to always copy skb's inside pppoe_xmit. Then, if dev_queue_xmit fails
the original is deleted.
In the common case dev_queue_xmit will not fail, and so in that case
I'd like to have to avoid making a copy of the skb.
Michal Ostrowski
[email protected]
Hello!
> However, could we not have dev_queue_xmit behave as such (not free
> frame on failure)?
If you need to hold original skb, you may hold its refcnt.
However, this feature inevitably results in big troubles: dev_queue_xmit()
is allowed to change skb and you cannot assume anything about this.
So that resuing skb dev_queue_xmit() is fatal bug not depending on anything.
> The reason why I'm proposing this is that ppp_generic.c assumes that
> the skb is still available after a transmission failure via pppoe. To
> support the semantics of dev_queue_xmit and ppp_generic we would have
> to always copy skb's inside pppoe_xmit. Then, if dev_queue_xmit fails
> the original is deleted.
You need not copy it. I said "clone".
Nobody is allowed to touch _data_ part of skb, so that you need not
to copy data.
Alexey
Alexey replied to my last post with some valuable comments and in
response I have a new patch (that goes on top of David Miller's patch
from yesterday).
The approach here is that in case we don't have room in the skb for
PPPoE headers, we create a new one (skb2) and copy the entire thing.
If we do have space, we clone it. We always give dev_queue_xmit the
copy/clone pointer and let it free it. We then kfree_skb the original
skb depending on the return value of dev_queue_xmit (to conform to the
expectations of ppp_generic).
Michal Ostrowski
[email protected]
--- drivers/net/pppoe.c~ Wed Jul 18 10:24:25 2001
+++ drivers/net/pppoe.c Thu Jul 19 14:49:36 2001
@@ -5,7 +5,7 @@
* PPPoE --- PPP over Ethernet (RFC 2516)
*
*
- * Version: 0.6.7
+ * Version: 0.6.8
*
* 030700 : Fixed connect logic to allow for disconnect.
* 270700 : Fixed potential SMP problems; we must protect against
@@ -25,8 +25,12 @@
* locked. (DaveM)
* Ignore return value of dev_queue_xmit in __pppoe_xmit
* or else we may kfree an SKB twice. (DaveM)
+ * 190701 : When doing copies of skb's in __pppoe_xmit, always delete
+ * the original skb that was passed in on success, never on
+ * failure. Delete the copy of the skb on failure to avoid
+ * a memory leak.
*
- * Author: Michal Ostrowski <[email protected]>
+ * Author: Michal Ostrowski <[email protected]>
* Contributors:
* Arnaldo Carvalho de Melo <[email protected]>
* David S. Miller ([email protected])
@@ -837,6 +841,7 @@
return error;
}
+
/************************************************************************
*
* xmit function for internal use.
@@ -849,6 +854,7 @@
struct pppoe_hdr *ph;
int headroom = skb_headroom(skb);
int data_len = skb->len;
+ struct sk_buff *skb2;
if (sk->dead || !(sk->state & PPPOX_CONNECTED))
goto abort;
@@ -864,7 +870,6 @@
/* Copy the skb if there is no space for the header. */
if (headroom < (sizeof(struct pppoe_hdr) + dev->hard_header_len)) {
- struct sk_buff *skb2;
skb2 = dev_alloc_skb(32+skb->len +
sizeof(struct pppoe_hdr) +
@@ -876,30 +881,36 @@
skb_reserve(skb2, dev->hard_header_len + sizeof(struct pppoe_hdr));
memcpy(skb_put(skb2, skb->len), skb->data, skb->len);
- skb_unlink(skb);
- kfree_skb(skb);
- skb = skb2;
+ } else {
+ /* Make a clone so as to not disturb the original skb,
+ * give dev_queue_xmit something it can free.
+ */
+ skb2 = skb_clone(skb, GFP_ATOMIC);
}
- /* We may not return error beyond this point. Potentially this
- * is a new SKB and in such a case we've already freed the
- * original SKB. So if we were to return error, our caller would
- * double free that original SKB. -DaveM
- */
-
- ph = (struct pppoe_hdr *) skb_push(skb, sizeof(struct pppoe_hdr));
+ ph = (struct pppoe_hdr *) skb_push(skb2, sizeof(struct pppoe_hdr));
memcpy(ph, &hdr, sizeof(struct pppoe_hdr));
- skb->protocol = __constant_htons(ETH_P_PPP_SES);
+ skb2->protocol = __constant_htons(ETH_P_PPP_SES);
- skb->nh.raw = skb->data;
+ skb2->nh.raw = skb2->data;
- skb->dev = dev;
+ skb2->dev = dev;
- dev->hard_header(skb, dev, ETH_P_PPP_SES,
+ dev->hard_header(skb2, dev, ETH_P_PPP_SES,
sk->protinfo.pppox->pppoe_pa.remote,
NULL, data_len);
- dev_queue_xmit(skb);
+ /* We're transmitting skb2, and assuming that dev_queue_xmit
+ * will free it. The generic ppp layer however, is expecting
+ * that we give back the skb in case of failure,
+ * but free it in case of success.
+ */
+
+ if (dev_queue_xmit(skb2)<0) {
+ goto abort;
+ }
+
+ kfree_skb(skb);
return 1;
abort:
@@ -1049,7 +1060,6 @@
int err = register_pppox_proto(PX_PROTO_OE, &pppoe_proto);
if (err == 0) {
- printk(KERN_INFO "Registered PPPoE v0.6.5\n");
dev_add_pack(&pppoes_ptype);
dev_add_pack(&pppoed_ptype);
--- drivers/net/pppox.c~ Tue Feb 13 16:15:05 2001
+++ drivers/net/pppox.c Wed Jul 18 10:27:25 2001
@@ -148,10 +148,6 @@
err = sock_register(&pppox_proto_family);
- if (err == 0) {
- printk(KERN_INFO "Registered PPPoX v0.5\n");
- }
-
return err;
}
Michal Ostrowski writes:
> Alexey replied to my last post with some valuable comments and in
> response I have a new patch (that goes on top of David Miller's patch
> from yesterday).
Applied to my tree, thanks.
Later,
David S. Miller
[email protected]
Hey, I'm having trouble applying your first patch to kernel 2.4.6. I have a
list of 18 failed hunks.
The command I used is: cd /usr/src/linux && patch -Np0 -i
/home/arch/pppoe-davidmiller.patch Is the patch for a different kernel? I
had the same problem applying it to 2.4.7-pre8.
Andrew Friedley
> Michal Ostrowski writes:
> > Alexey replied to my last post with some valuable comments and in
> > response I have a new patch (that goes on top of David Miller's patch
> > from yesterday).
>
> Applied to my tree, thanks.
>
> Later,
> David S. Miller
> [email protected]
On Tue, Jul 17, 2001 at 08:58:32PM -0700, David S. Miller wrote:
> Initially we believed it might be some obscure bug in netfilter
> which got triggered more easily when the zerocopy stuff went in.
> But all of our code audits turned up nothing.
>
> Then I began to notice that "pppoe" was showing up in all the reports
> where the user actually bothered to mention what net devices the
> machine was using when it crashed.
well, I have no PPPoE running and my OOPSen look very similar. (see MsgId
<[email protected]>).
I am using tulip, dummy, Ben Grear's dot1q VLAN devices and some ISDN
syncppp and ISDN rawip devices are configured (but not actively used),
too.
Rainer
--
KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD
Rainer Clasen writes:
> I am using tulip, dummy, Ben Grear's dot1q VLAN devices and some ISDN
> syncppp and ISDN rawip devices are configured (but not actively used),
> too.
Can you test without dummy and VLAN? Man, I now have to audit that
friggin' code too :-(
Later,
David S. Miller
[email protected]
On Fri, Jul 20, 2001 at 12:28:35AM -0700, David S. Miller wrote:
>
> Rainer Clasen writes:
> > I am using tulip, dummy, Ben Grear's dot1q VLAN devices and some ISDN
> > syncppp and ISDN rawip devices are configured (but not actively used),
> > too.
>
> Can you test without dummy and VLAN? Man, I now have to audit that
> friggin' code too :-(
As first step I've removed dummy. Eliminating Vlan is difficult and will take
me some more time.
I could easily reproduce the oops with several nmap -sS through this router.
# ksymoops -K -L -o /lib/modules/2.4.6/ -m /boot/System.map-2.4.6-obs.1.1 < blurb
ksymoops 2.4.1 on i586 2.4.1. Options used
-V (default)
-K (specified)
-L (specified)
-o /lib/modules/2.4.6/ (specified)
-m /boot/System.map-2.4.6-obs.1.1 (specified)
No modules in ksyms, skipping objects
Unable to handle kernel paging request at virtual address 67720a25 printing eip:
c012612a
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c012612a>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 67720a0d ebx: 00000000 ecx: 67720a0d edx: 00000000
esi: c165d800 edi: c12d2680 ebp: 00000060 esp: c0209dd8
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0209000)
Stack: c0181e4d fffff800 c165d800 c0182443 c165d800 c165d800 c12f3000 c12c10a0
c12f3000 ffffffee c01853bd c165d800 00000020 c165d800 00000000 c12c10a0
c0188935 c165d800 c165d800 00000000 00000004 c01961cc c019625d c165d800
Call Trace: [<c0181e4d>] [<c0182443>] [<c01853bd>] [<c0188935>] [<c01961cc>] [<c019625d>] [<c018aa56>]
[<c01938b0>] [<c01961b2>] [<c01961cc>] [<c01938fa>] [<c018aa56>] [<c019385b>] [<c01938b0>] [<c0192c69>]
[<c0192aa8>] [<c018aa56>] [<c01928f6>] [<c0192aa8>] [<c0185a8d>] [<c0113aff>] [<c0107e5d>] [<c0105120>]
[<c0106b60>] [<c0105120>] [<c0105143>] [<c01051a7>] [<c0105000>]
Code: 8b 41 18 85 c0 7c 11 ff 49 14 0f 94 c0 84 c0 74 07 89 c8 e8
>>EIP; c012612a <__free_pages+2/1c> <=====
Trace; c0181e4d <skb_release_data+41/74>
Trace; c0182443 <skb_linearize+cf/130>
Trace; c01853bd <dev_queue_xmit+6d/244>
Trace; c0188935 <neigh_connected_output+95/c8>
Trace; c01961cc <ip_finish_output2+0/c8>
Trace; c019625d <ip_finish_output2+91/c8>
Trace; c018aa56 <nf_hook_slow+ee/144>
Trace; c01938b0 <ip_forward_finish+0/50>
Trace; c01961b2 <ip_finish_output+ee/f4>
Trace; c01961cc <ip_finish_output2+0/c8>
Trace; c01938fa <ip_forward_finish+4a/50>
Trace; c018aa56 <nf_hook_slow+ee/144>
Trace; c019385b <ip_forward+1eb/240>
Trace; c01938b0 <ip_forward_finish+0/50>
Trace; c0192c69 <ip_rcv_finish+1c1/1f8>
Trace; c0192aa8 <ip_rcv_finish+0/1f8>
Trace; c018aa56 <nf_hook_slow+ee/144>
Trace; c01928f6 <ip_rcv+376/3b0>
Trace; c0192aa8 <ip_rcv_finish+0/1f8>
Trace; c0185a8d <net_rx_action+135/258>
Trace; c0113aff <do_softirq+3f/68>
Trace; c0107e5d <do_IRQ+9d/b0>
Trace; c0105120 <default_idle+0/28>
Trace; c0106b60 <ret_from_intr+0/7>
Trace; c0105120 <default_idle+0/28>
Trace; c0105143 <default_idle+23/28>
Trace; c01051a7 <cpu_idle+3f/54>
Trace; c0105000 <_stext+0/0>
Code; c012612a <__free_pages+2/1c>
00000000 <_EIP>:
Code; c012612a <__free_pages+2/1c> <=====
0: 8b 41 18 mov 0x18(%ecx),%eax <=====
Code; c012612d <__free_pages+5/1c>
3: 85 c0 test %eax,%eax
Code; c012612f <__free_pages+7/1c>
5: 7c 11 jl 18 <_EIP+0x18> c0126142 <__free_pages+1a/1c>
Code; c0126131 <__free_pages+9/1c>
7: ff 49 14 decl 0x14(%ecx)
Code; c0126134 <__free_pages+c/1c>
a: 0f 94 c0 sete %al
Code; c0126137 <__free_pages+f/1c>
d: 84 c0 test %al,%al
Code; c0126139 <__free_pages+11/1c>
f: 74 07 je 18 <_EIP+0x18> c0126142 <__free_pages+1a/1c>
Code; c012613b <__free_pages+13/1c>
11: 89 c8 mov %ecx,%eax
Code; c012613d <__free_pages+15/1c>
13: e8 00 00 00 00 call 18 <_EIP+0x18> c0126142 <__free_pages+1a/1c>
Kernel panic: Aiee, killing interrupt handler!
Rainer
--
KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD
In message <005f01c10e69$28273e60$0200a8c0@loki> you write:
> My problem is, that I seem to be having "random" kernel panics. When I say
> random I mean that the system will run 30 minutes before it panics, or as
Looks like the pppoe problem DaveM found. Patch in progress...
Thanks,
Rusty.
--
Premature optmztion is rt of all evl. --DK