This patch fixes the problem where tcpdump shows duplicate packets
while tracing outbound packets on drivers which support lockless
transmit. The patch changes the current behaviour to tracing the
packets only on a successful transmit.
Signed-off-by: Ranjit Manomohan <[email protected]>
--- linux-2.6/net/sched/sch_generic.c 2006-05-10 12:34:52.000000000 -0700
+++ linux/net/sched/sch_generic.c 2006-05-10 12:39:38.000000000 -0700
@@ -136,8 +136,12 @@
if (!netif_queue_stopped(dev)) {
int ret;
+ struct sk_buff *skbc = NULL;
+ /* Clone the skb so that we hold a reference
+ * to its data and we can trace it after a
+ * successful transmit. */
if (netdev_nit)
- dev_queue_xmit_nit(skb, dev);
+ skbc = skb_clone(skb, GFP_ATOMIC);
ret = dev->hard_start_xmit(skb, dev);
if (ret == NETDEV_TX_OK) {
@@ -145,6 +149,15 @@
dev->xmit_lock_owner = -1;
spin_unlock(&dev->xmit_lock);
}
+ if(skbc) {
+ /* transmit succeeded,
+ * trace the clone. */
+ dev_queue_xmit_nit(skbc,dev);
+ kfree_skb(skbc);
+ }
+ /* Free clone if it exists */
+ if(skbc)
+ kfree_skb(skbc);
spin_lock(&dev->queue_lock);
return -1;
}
Ranjit Manomohan <[email protected]> wrote:
>
> This patch fixes the problem where tcpdump shows duplicate packets
> while tracing outbound packets on drivers which support lockless
> transmit. The patch changes the current behaviour to tracing the
> packets only on a successful transmit.
>
There was no feedback on this one?
>
> --- linux-2.6/net/sched/sch_generic.c 2006-05-10 12:34:52.000000000 -0700
> +++ linux/net/sched/sch_generic.c 2006-05-10 12:39:38.000000000 -0700
> @@ -136,8 +136,12 @@
>
> if (!netif_queue_stopped(dev)) {
> int ret;
> + struct sk_buff *skbc = NULL;
> + /* Clone the skb so that we hold a reference
> + * to its data and we can trace it after a
> + * successful transmit. */
Like this:
/*
* Clone the skb so that we hold a reference to
* its data and we can trace it after a
* successful transmit
*/
> if (netdev_nit)
> - dev_queue_xmit_nit(skb, dev);
> + skbc = skb_clone(skb, GFP_ATOMIC);
>
> ret = dev->hard_start_xmit(skb, dev);
> if (ret == NETDEV_TX_OK) {
> @@ -145,6 +149,15 @@
> dev->xmit_lock_owner = -1;
> spin_unlock(&dev->xmit_lock);
> }
> + if(skbc) {
Like this:
if (skbc)
> + /* transmit succeeded,
> + * trace the clone. */
> + dev_queue_xmit_nit(skbc,dev);
> + kfree_skb(skbc);
> + }
> + /* Free clone if it exists */
> + if(skbc)
if (skbc)
> + kfree_skb(skbc);
We don't need to test for skbc==NULL - kfree_skb(NULL) is legal.
This code will end up running kfree_skb(skbc) twice. Unless
dev_queue_xmit_nit() takes an additional ref on the skb (I don't think it
does), this will cause corruption of freed memory.
> spin_lock(&dev->queue_lock);
> return -1;
> }
dev_queue_xmit_nit() already clones the skb. It's a bit sad to be taking a
clone of a clone like this. Avoidable?
Thank you for the comments. Incorporated feedback into current version.
-Thanks,
Ranjit
--- linux-2.6/net/sched/sch_generic.c 2006-05-10 12:34:52.000000000 -0700
+++ linux/net/sched/sch_generic.c 2006-05-14 08:53:33.000000000 -0700
@@ -136,8 +136,12 @@
if (!netif_queue_stopped(dev)) {
int ret;
+ struct sk_buff *skbc = NULL;
+ /* Clone the skb so that we hold a reference
+ * to its data and we can trace it after a
+ * successful transmit. */
if (netdev_nit)
- dev_queue_xmit_nit(skb, dev);
+ skbc = skb_clone(skb, GFP_ATOMIC);
ret = dev->hard_start_xmit(skb, dev);
if (ret == NETDEV_TX_OK) {
@@ -145,9 +149,17 @@
dev->xmit_lock_owner = -1;
spin_unlock(&dev->xmit_lock);
}
+ if (skbc) {
+ /* transmit succeeded,
+ * trace the clone. */
+ dev_queue_xmit_nit(skbc,dev);
+ kfree_skb(skbc);
+ }
spin_lock(&dev->queue_lock);
return -1;
}
+ /* Free clone if it exists */
+ kfree_skb(skbc);
if (ret == NETDEV_TX_LOCKED && nolock) {
spin_lock(&dev->queue_lock);
goto collision;
From: Andrew Morton <[email protected]>
Date: Sun, 14 May 2006 03:10:34 -0700
> It's a bit sad to be taking a clone of a clone like this.
> Avoidable?
Besides, clones of clones are illegal, if it's already a clone
you must make a copy.
On Sun, 14 May 2006, David S. Miller wrote:
> From: Andrew Morton <[email protected]>
> Date: Sun, 14 May 2006 03:10:34 -0700
>
> > It's a bit sad to be taking a clone of a clone like this.
> > Avoidable?
>
> Besides, clones of clones are illegal, if it's already a clone
> you must make a copy.
>
Heres a new version which does a copy instead of the clone to avoid
the double cloning issue.
-Thanks,
Ranjit
--- linux-2.6/net/sched/sch_generic.c 2006-05-10 12:34:52.000000000 -0700
+++ linux/net/sched/sch_generic.c 2006-05-15 13:49:09.000000000 -0700
@@ -136,8 +136,11 @@
if (!netif_queue_stopped(dev)) {
int ret;
+ struct sk_buff *skbc = NULL;
+ /* Copy the skb so that we can trace it after
+ * a successful transmit. */
if (netdev_nit)
- dev_queue_xmit_nit(skb, dev);
+ skbc = skb_copy(skb, GFP_ATOMIC);
ret = dev->hard_start_xmit(skb, dev);
if (ret == NETDEV_TX_OK) {
@@ -145,9 +148,20 @@
dev->xmit_lock_owner = -1;
spin_unlock(&dev->xmit_lock);
}
+ if (skbc) {
+ /* transmit succeeded,
+ * trace the copy. */
+ dev_queue_xmit_nit(skbc,dev);
+ kfree_skb(skbc);
+ }
spin_lock(&dev->queue_lock);
return -1;
}
+
+ /* Free copy if it exists */
+ if (skbc)
+ kfree_skb(skbc);
+
if (ret == NETDEV_TX_LOCKED && nolock) {
spin_lock(&dev->queue_lock);
goto collision;
From: Ranjit Manomohan <[email protected]>
Date: Mon, 15 May 2006 14:19:06 -0700 (PDT)
> Heres a new version which does a copy instead of the clone to avoid
> the double cloning issue.
I still very much dislike this patch because it is creating
1 more clone per packet than is actually necessary and that
is very expensive.
dev_queue_xmit_nit() is going to clone whatever SKB you send into
there, so better to just bump the reference count (with skb_get())
instead of cloning or copying.
David S. Miller wrote:
> From: Ranjit Manomohan <[email protected]>
> Date: Mon, 15 May 2006 14:19:06 -0700 (PDT)
>
>
>>Heres a new version which does a copy instead of the clone to avoid
>>the double cloning issue.
>
>
> I still very much dislike this patch because it is creating
> 1 more clone per packet than is actually necessary and that
> is very expensive.
>
> dev_queue_xmit_nit() is going to clone whatever SKB you send into
> there, so better to just bump the reference count (with skb_get())
> instead of cloning or copying.
I think this would break the tc actions. Some actions call
pskb_expand_head() on input, which BUGs on skb_shared(skb).
They can't clone the skb instead because the functions
doing that don't own it, the caller would continue with the
old skb.
On Mon, 15 May 2006, David S. Miller wrote:
> From: Ranjit Manomohan <[email protected]>
> Date: Mon, 15 May 2006 14:19:06 -0700 (PDT)
>
> > Heres a new version which does a copy instead of the clone to avoid
> > the double cloning issue.
>
> I still very much dislike this patch because it is creating
> 1 more clone per packet than is actually necessary and that
> is very expensive.
>
> dev_queue_xmit_nit() is going to clone whatever SKB you send into
> there, so better to just bump the reference count (with skb_get())
> instead of cloning or copying.
>
I was a bit apprehensive about just incrementing the refcnt but that works
too. Attached is the modified version.
-Thanks,
Ranjit
--- linux-2.6/net/sched/sch_generic.c 2006-05-10 12:34:52.000000000 -0700
+++ linux/net/sched/sch_generic.c 2006-05-15 15:48:03.000000000 -0700
@@ -136,8 +136,12 @@
if (!netif_queue_stopped(dev)) {
int ret;
+ struct sk_buff *skbc = NULL;
+ /* Increment the reference count on the skb so
+ * that we can use it after a successful xmit.
+ */
if (netdev_nit)
- dev_queue_xmit_nit(skb, dev);
+ skbc = skb_get(skb);
ret = dev->hard_start_xmit(skb, dev);
if (ret == NETDEV_TX_OK) {
@@ -145,9 +149,20 @@
dev->xmit_lock_owner = -1;
spin_unlock(&dev->xmit_lock);
}
+ if (skbc) {
+ /* transmit succeeded,
+ * trace the buffer. */
+ dev_queue_xmit_nit(skbc,dev);
+ kfree_skb(skbc);
+ }
spin_lock(&dev->queue_lock);
return -1;
}
+
+ /* Call free in case we incremented refcnt */
+ if (skbc)
+ kfree_skb(skbc);
+
if (ret == NETDEV_TX_LOCKED && nolock) {
spin_lock(&dev->queue_lock);
goto collision;
On Mon, 15 May 2006 16:11:05 -0700 (PDT)
Ranjit Manomohan <[email protected]> wrote:
> On Mon, 15 May 2006, David S. Miller wrote:
>
> > From: Ranjit Manomohan <[email protected]>
> > Date: Mon, 15 May 2006 14:19:06 -0700 (PDT)
> >
> > > Heres a new version which does a copy instead of the clone to avoid
> > > the double cloning issue.
> >
> > I still very much dislike this patch because it is creating
> > 1 more clone per packet than is actually necessary and that
> > is very expensive.
> >
> > dev_queue_xmit_nit() is going to clone whatever SKB you send into
> > there, so better to just bump the reference count (with skb_get())
> > instead of cloning or copying.
> >
>
> I was a bit apprehensive about just incrementing the refcnt but that works
> too. Attached is the modified version.
>
> -Thanks,
> Ranjit
>
> --- linux-2.6/net/sched/sch_generic.c 2006-05-10 12:34:52.000000000 -0700
> +++ linux/net/sched/sch_generic.c 2006-05-15 15:48:03.000000000 -0700
> @@ -136,8 +136,12 @@
>
> if (!netif_queue_stopped(dev)) {
> int ret;
> + struct sk_buff *skbc = NULL;
> + /* Increment the reference count on the skb so
> + * that we can use it after a successful xmit.
> + */
> if (netdev_nit)
> - dev_queue_xmit_nit(skb, dev);
> + skbc = skb_get(skb);
skbc = netdev_nit ? skb_get(skb) : NULL;
>
> ret = dev->hard_start_xmit(skb, dev);
> if (ret == NETDEV_TX_OK) {
> @@ -145,9 +149,20 @@
> dev->xmit_lock_owner = -1;
> spin_unlock(&dev->xmit_lock);
> }
> + if (skbc) {
> + /* transmit succeeded,
> + * trace the buffer. */
> + dev_queue_xmit_nit(skbc,dev);
> + kfree_skb(skbc);
> + }
> spin_lock(&dev->queue_lock);
> return -1;
> }
> +
> + /* Call free in case we incremented refcnt */
> + if (skbc)
> + kfree_skb(skbc);
kfree_skb(NULL) is legal so the conditional here is unneeded.
But the increased calls to kfree_skb(NULL) would probably bring the
"unlikely()" hordes descending on kfree_skb, so maybe:
if (unlikely(netdev_nit))
kfree_skb(skbc);
From: Stephen Hemminger <[email protected]>
Date: Mon, 15 May 2006 16:41:01 -0700
> kfree_skb(NULL) is legal so the conditional here is unneeded.
>
> But the increased calls to kfree_skb(NULL) would probably bring the
> "unlikely()" hordes descending on kfree_skb, so maybe:
And unfortunately as Patrick McHardy states, we can't use
this trick here because things like tc actions can do stuff
like pskb_expand_head() which cannot handle shared SKBs.
We need another solution to this problem, because cloning an
extra SKB is just rediculious overhead so isn't something we
can seriously consider to solve this problem.
Another option is to say this anomaly doesn't matter enough
to justify the complexity we're looking at here just to fix
this glitch.
Other implementation possibility suggestions welcome :-)
David S. Miller wrote:
> From: Stephen Hemminger <[email protected]>
> Date: Mon, 15 May 2006 16:41:01 -0700
>
>
>>kfree_skb(NULL) is legal so the conditional here is unneeded.
>>
>>But the increased calls to kfree_skb(NULL) would probably bring the
>>"unlikely()" hordes descending on kfree_skb, so maybe:
>
>
> And unfortunately as Patrick McHardy states, we can't use
> this trick here because things like tc actions can do stuff
> like pskb_expand_head() which cannot handle shared SKBs.
>
> We need another solution to this problem, because cloning an
> extra SKB is just rediculious overhead so isn't something we
> can seriously consider to solve this problem.
>
> Another option is to say this anomaly doesn't matter enough
> to justify the complexity we're looking at here just to fix
> this glitch.
>
> Other implementation possibility suggestions welcome :-)
We could just mark the skb to make sure its only passed to taps
on the first transmission attempt. Since we have the timestamp
optimization there shouldn't be any visible change besides
the desired effect.
David S. Miller <[email protected]> wrote:
>
> Other implementation possibility suggestions welcome :-)
I see two possibilities:
1) Move the af_packet hook into the NIC driver.
2) Rethink the lockless tx setup. If all NICs followed the tg3 and
replaced spin_lock_irqsave with spin_lock then we should be able
to go back to just using the xmit_lock.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, May 16, 2006 at 02:21:27AM +0200, Patrick McHardy wrote:
> David S. Miller wrote:
> > From: Stephen Hemminger <[email protected]>
> > Date: Mon, 15 May 2006 16:41:01 -0700
> >
> >
> >>kfree_skb(NULL) is legal so the conditional here is unneeded.
> >>
> >>But the increased calls to kfree_skb(NULL) would probably bring the
> >>"unlikely()" hordes descending on kfree_skb, so maybe:
> >
> >
> > And unfortunately as Patrick McHardy states, we can't use
> > this trick here because things like tc actions can do stuff
> > like pskb_expand_head() which cannot handle shared SKBs.
> >
> > We need another solution to this problem, because cloning an
> > extra SKB is just rediculious overhead so isn't something we
> > can seriously consider to solve this problem.
> >
> > Another option is to say this anomaly doesn't matter enough
> > to justify the complexity we're looking at here just to fix
> > this glitch.
> >
> > Other implementation possibility suggestions welcome :-)
>
>
> We could just mark the skb to make sure its only passed to taps
> on the first transmission attempt. Since we have the timestamp
> optimization there shouldn't be any visible change besides
> the desired effect.
It would be nice to have a solution that taps after its been sent to the
driver. I clearly need to investigate the tc problems, but I've been using a
similar patch now for a few weeks to allow more accurate timestamps to be
performed in the driver without cloning problems getting in the way. (the tap
skb only gets cloned after the driver has time stamped it). This is more
accuracy than most need but is required for the clock synchronisation i'm
working on. It also seems less intrusive to let the driver have the skb first
before pushing it to the taps.
Tom
--
Thomas Young
http://cubinlab.ee.unimelb.edu.au/~tyo/
Research Assistant
CUBIN Research Centre - University of Melbourne
Herbert Xu wrote:
> David S. Miller <[email protected]> wrote:
>
>>Other implementation possibility suggestions welcome :-)
>
>
> I see two possibilities:
>
> 1) Move the af_packet hook into the NIC driver.
> 2) Rethink the lockless tx setup. If all NICs followed the tg3 and
> replaced spin_lock_irqsave with spin_lock then we should be able
> to go back to just using the xmit_lock.
3) Clone the skb and have dev_queue_xmit_nit() consume it.
That should actually be pretty easy.
On Tue, May 16, 2006 at 03:17:59AM +0200, Patrick McHardy wrote:
>
> 3) Clone the skb and have dev_queue_xmit_nit() consume it.
>
> That should actually be pretty easy.
Unfortunately that would mean an unconditional copy for all TSO packets
on NICs such as tg3/e1000. These drivers have to modify the data area
of the skb.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Patrick McHardy wrote:
> 3) Clone the skb and have dev_queue_xmit_nit() consume it.
>
> That should actually be pretty easy.
On second thought, thats not so great either. netdev_nit
just globally signals that there are some taps, but we
don't know if they're interested in a specific packet.
From: Patrick McHardy <[email protected]>
Date: Tue, 16 May 2006 03:22:21 +0200
> Patrick McHardy wrote:
> > 3) Clone the skb and have dev_queue_xmit_nit() consume it.
> >
> > That should actually be pretty easy.
>
> On second thought, thats not so great either. netdev_nit
> just globally signals that there are some taps, but we
> don't know if they're interested in a specific packet.
Yes, and all of these issues coming up are why we do the
dev_queue_xmit_nit() before not after we try to give it to the device.
Even the "NIT done" bit doesn't work, for cases like ICMP replies
which can reuse the SKB received, over loopback. Sure we can try to
forcefully clear the bit everywhere we reuse buffers for replies like
that, but I wish anyone trying to implement that the best of luck
finding all spots successfully.
To be honest, I don't think this bug is worth all the energy we are
trying to put into fixing it.
We get a double packet when the spinlock is hit, big deal.