2006-09-16 12:08:48

by Vladimir B. Savkin

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Mon, Jun 19, 2006 at 05:24:31PM +0200, Andi Kleen wrote:
>
> > If you use "pmtmr" try to reboot with kernel option "clock=tsc".
>
> That's dangerous advice - when the system choses not to use
> TSC it often has a reason.

I just found out that TSC clocksource is not implemented on x86-64.
Kernel version 2.6.18-rc7, is it true?

I've also had experience of unsychronized TSC on dual-core Athlon,
but it was cured by idle=poll.

>
> >
> > On my Opteron AMD system i normally can route 400 kpps, but with
> > timesource "pmtmr" i could only route around 83 kpps. (I found the timer
> > to be the issue by using oprofile).
>
> Unless you're using packet sniffing or any other application
> that requests time stamps on a socket then the timer shouldn't
> make much difference. Incoming packets are only time stamped
> when someone asks for the timestamps.
>
It seems that dhcpd3 makes the box timestamping incoming packets,
killing the performance. I think that combining router and DHCP server
on a same box is a legitimate situation, isn't it?

~
:wq
With best regards,
Vladimir Savkin.


2006-09-18 08:36:07

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

"Vladimir B. Savkin" <[email protected]> writes:

> On Mon, Jun 19, 2006 at 05:24:31PM +0200, Andi Kleen wrote:
> >
> > > If you use "pmtmr" try to reboot with kernel option "clock=tsc".
> >
> > That's dangerous advice - when the system choses not to use
> > TSC it often has a reason.
>
> I just found out that TSC clocksource is not implemented on x86-64.
> Kernel version 2.6.18-rc7, is it true?

The x86-64 timer subsystems currently doesn't have clocksources
at all, but it supports TSC and some other timers.

>
> I've also had experience of unsychronized TSC on dual-core Athlon,
> but it was cured by idle=poll.

You can use that, but it will make your system run quite hot
and cost you a lot of powe^wmoney.

> It seems that dhcpd3 makes the box timestamping incoming packets,
> killing the performance. I think that combining router and DHCP server
> on a same box is a legitimate situation, isn't it?


Yes. Good point. DHCP is broken and needs to be fixed. Can you
send a bug report to the DHCP maintainers?

iirc the problem used to be that RAW sockets didn't do something
they need them to do. Maybe we can fix that now.

If that's not possible we can probably add a ioctl or similar
to disable time stamping for packet sockets (DHCP shouldn't really
need a fine grained time stamp). dhcpcd would need to use that then.

Keep me updated what they say.

-Andi

2006-09-18 09:03:33

by Vladimir B. Savkin

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Mon, Sep 18, 2006 at 10:35:38AM +0200, Andi Kleen wrote:
> > I just found out that TSC clocksource is not implemented on x86-64.
> > Kernel version 2.6.18-rc7, is it true?
>
> The x86-64 timer subsystems currently doesn't have clocksources
> at all, but it supports TSC and some other timers.

Hm. On my box, TSC did not work, until I hacked arch/i386/kernel/tsc.c
in it.
Neither clock=tsc nor clocksource=tsc didn't have any effect.

> > I've also had experience of unsychronized TSC on dual-core Athlon,
> > but it was cured by idle=poll.
>
> You can use that, but it will make your system run quite hot
> and cost you a lot of powe^wmoney.

Here in Russia electric power is cheap compared with hardware upgrade.

> > It seems that dhcpd3 makes the box timestamping incoming packets,
> > killing the performance. I think that combining router and DHCP server
> > on a same box is a legitimate situation, isn't it?
>
> Yes. Good point. DHCP is broken and needs to be fixed. Can you
> send a bug report to the DHCP maintainers?
>
> iirc the problem used to be that RAW sockets didn't do something
> they need them to do. Maybe we can fix that now.

Will try some days later.

Oh, and pppoe-server uses some kind of packet socket too, doesn't it?

>
> If that's not possible we can probably add a ioctl or similar
> to disable time stamping for packet sockets (DHCP shouldn't really
> need a fine grained time stamp). dhcpcd would need to use that then.

I would like some sysctl very much, too. Let tcpdump show imprecise
timestamps when forwarding performance is more important.
After all, Ciscos don't have any tcpdump analog at all, and they are
very popular :)

>
> Keep me updated what they say.
>
> -Andi
>
~
:wq
With best regards,
Vladimir Savkin.

2006-09-18 09:58:31

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

"Vladimir B. Savkin" <[email protected]> writes:

> On Mon, Sep 18, 2006 at 10:35:38AM +0200, Andi Kleen wrote:
> > > I just found out that TSC clocksource is not implemented on x86-64.
> > > Kernel version 2.6.18-rc7, is it true?
> >
> > The x86-64 timer subsystems currently doesn't have clocksources
> > at all, but it supports TSC and some other timers.
>

> until I hacked arch/i386/kernel/tsc.c

Then you don't use x86-64.


>
> > > I've also had experience of unsychronized TSC on dual-core Athlon,
> > > but it was cured by idle=poll.
> >
> > You can use that, but it will make your system run quite hot
> > and cost you a lot of powe^wmoney.
>
> Here in Russia electric power is cheap compared with hardware upgrade.

It's not just electrical power - the hardware is more stressed and will
likely fail earlier too. As a rule of thumb the hotter your hardware runs
the earlier it will fail.

>
> > > It seems that dhcpd3 makes the box timestamping incoming packets,
> > > killing the performance. I think that combining router and DHCP server
> > > on a same box is a legitimate situation, isn't it?
> >
> > Yes. Good point. DHCP is broken and needs to be fixed. Can you
> > send a bug report to the DHCP maintainers?
> >
> > iirc the problem used to be that RAW sockets didn't do something
> > they need them to do. Maybe we can fix that now.
>
> Will try some days later.
>
> Oh, and pppoe-server uses some kind of packet socket too, doesn't it?

The problem is not really using a packet socket, but using the SIOCGSTAMP
ioctl on it. As soon as someone issues it the system will take accurate
time stamps for each incoming packet until the respective socket is closed.

Quick fix is to change user space to use gettimeofday() when it reads
the packet instead.

For netdev: I'm more and more thinking we should just avoid the problem
completely and switch to "true end2end" timestamps. This means don't
time stamp when a packet is received, but only when it is delivered
to a socket. The timestamp at receiving is a lie anyways because
the network hardware can add an arbitary long delay before the driver interrupt
handler runs. Then the problem above would completely disappear.
Comments? Opinions?

-Andi

2006-09-18 10:29:22

by Vladimir B. Savkin

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Mon, Sep 18, 2006 at 11:58:21AM +0200, Andi Kleen wrote:
> > > The x86-64 timer subsystems currently doesn't have clocksources
> > > at all, but it supports TSC and some other timers.
> >
>
> > until I hacked arch/i386/kernel/tsc.c
>
> Then you don't use x86-64.
>
Oh. I mean I made arch/i386/kernel/tsc.c compile on x86-64
by hacking some Makefiles and headers.

But the question is, why stock 2.6.18-rc7 could not use TSC on its own?

> > > > I've also had experience of unsychronized TSC on dual-core Athlon,
> > > > but it was cured by idle=poll.
> > >
> > > You can use that, but it will make your system run quite hot
> > > and cost you a lot of powe^wmoney.
> >
> > Here in Russia electric power is cheap compared with hardware upgrade.
>
> It's not just electrical power - the hardware is more stressed and will
> likely fail earlier too. As a rule of thumb the hotter your hardware runs
> the earlier it will fail.

What hardware exactly. Doesn't it affect only CPU? And they are not
know to fail before any other components.
> >
> > > > It seems that dhcpd3 makes the box timestamping incoming packets,
> > > > killing the performance. I think that combining router and DHCP server
> > > > on a same box is a legitimate situation, isn't it?
> > >
> > > Yes. Good point. DHCP is broken and needs to be fixed. Can you
> > > send a bug report to the DHCP maintainers?
> > >
> > > iirc the problem used to be that RAW sockets didn't do something
> > > they need them to do. Maybe we can fix that now.
> >
> > Will try some days later.
> >
> > Oh, and pppoe-server uses some kind of packet socket too, doesn't it?
>
> The problem is not really using a packet socket, but using the SIOCGSTAMP
> ioctl on it. As soon as someone issues it the system will take accurate
> time stamps for each incoming packet until the respective socket is closed.
>
> Quick fix is to change user space to use gettimeofday() when it reads
> the packet instead.

Ok, thank you, I now understand.

>
> For netdev: I'm more and more thinking we should just avoid the problem
> completely and switch to "true end2end" timestamps. This means don't
> time stamp when a packet is received, but only when it is delivered
> to a socket. The timestamp at receiving is a lie anyways because
> the network hardware can add an arbitary long delay before the driver interrupt
> handler runs. Then the problem above would completely disappear.
> Comments? Opinions?
>
> -Andi
>
~
:wq
With best regards,
Vladimir Savkin.

2006-09-18 11:28:06

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

"Vladimir B. Savkin" <[email protected]> writes:

[you seem to send your emails in a strange way that doesn't keep me in cc.
Please stop doing that.]

> On Mon, Sep 18, 2006 at 11:58:21AM +0200, Andi Kleen wrote:
> > > > The x86-64 timer subsystems currently doesn't have clocksources
> > > > at all, but it supports TSC and some other timers.
> > >
> >
> > > until I hacked arch/i386/kernel/tsc.c
> >
> > Then you don't use x86-64.
> >
> Oh. I mean I made arch/i386/kernel/tsc.c compile on x86-64
> by hacking some Makefiles and headers.

The codebase for timing (and lots of other things) is quite different
between 32bit and 64bit. You're really surprised it doesn't work if you do such things?

> But the question is, why stock 2.6.18-rc7 could not use TSC on its own?

x86-64 doesn't use the TSC when it deems it to not be reliable, which
is the case on your system.

> > > > > I've also had experience of unsychronized TSC on dual-core Athlon,
> > > > > but it was cured by idle=poll.
> > > >
> > > > You can use that, but it will make your system run quite hot
> > > > and cost you a lot of powe^wmoney.
> > >
> > > Here in Russia electric power is cheap compared with hardware upgrade.
> >
> > It's not just electrical power - the hardware is more stressed and will
> > likely fail earlier too. As a rule of thumb the hotter your hardware runs
> > the earlier it will fail.
>
> What hardware exactly. Doesn't it affect only CPU? And they are not
> know to fail before any other components.

All hardware. It's basic physics.

-Andi

2006-09-18 14:09:13

by David Miller

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

From: Andi Kleen <[email protected]>
Date: 18 Sep 2006 11:58:21 +0200

> For netdev: I'm more and more thinking we should just avoid the
> problem completely and switch to "true end2end" timestamps. This
> means don't time stamp when a packet is received, but only when it
> is delivered to a socket. The timestamp at receiving is a lie
> anyways because the network hardware can add an arbitary long delay
> before the driver interrupt handler runs. Then the problem above
> would completely disappear.

I don't think this is wise.

People who run tcpdump want "wire" timestamps as close as possible.
Yes, things get delayed with the IRQ path, DMA delays, IRQ
mitigation and whatnot, but it's an order of magnitude worse if
you delay to user read() since that introduces also the delay of
the packet copies to userspace which are significantly larger than
these hardware level delays. If tcpdump gets swapped out, the
timestamp delay can be on the order of several seconds making it
totally useless.

Andi, you will need to find another solution to this problem :-)

2006-09-18 14:30:00

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20


>
> People who run tcpdump want "wire" timestamps as close as possible.
> Yes, things get delayed with the IRQ path, DMA delays, IRQ
> mitigation and whatnot, but it's an order of magnitude worse if
> you delay to user read() since that introduces also the delay of
> the packet copies to userspace which are significantly larger than
> these hardware level delays. If tcpdump gets swapped out, the
> timestamp delay can be on the order of several seconds making it
> totally useless.

My proposal wasn't to delay to user read, just to do the time stamp in socket
context. This means as soon as packet or RAW/UDP have looked up the socket and can
check a per socket flag do the time stamp.

The only delay this would add would be the queueing time from the NIC
to the softirq. Do you really think that is that bad?

-Andi

2006-09-18 14:57:01

by Alan

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Ar Llu, 2006-09-18 am 16:29 +0200, ysgrifennodd Andi Kleen:
> The only delay this would add would be the queueing time from the NIC
> to the softirq. Do you really think that is that bad?

If you are trying to do things like network record/playback then you
want the minimal delay. There's a reason the original timestamp code
supported the hardware setting the timestamp itself - we actually had a
separare set of logic on a board that was doing the timestamping by
watching the IRQ line of the NIC chip.

Alan

2006-09-18 15:19:41

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Monday 18 September 2006 17:19, Alan Cox wrote:
> Ar Llu, 2006-09-18 am 16:29 +0200, ysgrifennodd Andi Kleen:
> > The only delay this would add would be the queueing time from the NIC
> > to the softirq. Do you really think that is that bad?
>
> If you are trying to do things like network record/playback then you
> want the minimal delay.

But it's not minimal. Maybe it was long ago when the code was designed
on a 3c509 but not with modern hardware: Think interrupt mitigation and NAPI.

And with NAPI we tend to process the packets directly after they
are fetched out of the RX queue, so there is practically no delay
between driver seeing the packet and softirq seeing it. All the queuing
is done either at hardware level or later at socket level.

> There's a reason the original timestamp code
> supported the hardware setting the timestamp itself - we actually had a
> separare set of logic on a board that was doing the timestamping by
> watching the IRQ line of the NIC chip.

That would be fine too (because it will be likely fast), but unfortunately
I don't know of any driver that does that.

-Andi

2006-09-18 15:39:33

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Hello!

> For netdev: I'm more and more thinking we should just avoid the problem
> completely and switch to "true end2end" timestamps. This means don't
> time stamp when a packet is received, but only when it is delivered
> to a socket.

This will work.

>From viewpoint of existing uses of timestamp by packet socket
this time is not worse. The only danger is violation of casuality
(when forwarded packet or reply packet gets timestamp earlier than
original packet). This pathology was main reason why timestamp
is recorded early, before packet is demultiplexed in netif_receive_skb().
But it is not a practical problem: delivery to packet/raw sockets
is occasionally placed _before_ delivery to real protocol handlers.


> handler runs. Then the problem above would completely disappear.

Well, not completely. Too slow clock source remains too slow clock source.
If it is so slow, that it results in "performance degradation", it just
should not be used at all, even such pariah as tcpdump wants to be fast.

Actually, I have a question. Why the subject is
"Network performance degradation from 2.6.11.12 to 2.6.16.20"?
I do not see beginning of the thread and cannot guess
why clock source degraded. :-)

Alexey

2006-09-18 15:54:51

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Monday 18 September 2006 17:38, Alexey Kuznetsov wrote:
> Hello!
>
> > For netdev: I'm more and more thinking we should just avoid the problem
> > completely and switch to "true end2end" timestamps. This means don't
> > time stamp when a packet is received, but only when it is delivered
> > to a socket.
>
> This will work.
>
> From viewpoint of existing uses of timestamp by packet socket
> this time is not worse. The only danger is violation of casuality
> (when forwarded packet or reply packet gets timestamp earlier than
> original packet).

Hmm, not sure how that could happen. Also is it a real problem
even if it could?

> > handler runs. Then the problem above would completely disappear.
>
> Well, not completely. Too slow clock source remains too slow clock source.
> If it is so slow, that it results in "performance degradation", it just
> should not be used at all, even such pariah as tcpdump wants to be fast.
>
> Actually, I have a question. Why the subject is
> "Network performance degradation from 2.6.11.12 to 2.6.16.20"?
> I do not see beginning of the thread and cannot guess
> why clock source degraded. :-)

It's a long and sad story.

Old kernels didn't disable the TSC on those boxes (multi core K8) and assumed
they were synchronized for timing purposes.

This initially mostly worked if you don't use cpufreq,
but over a longer uptime the TSCs would drift against each other and timing
would jump more and more between CPUs.

On older versions of K8 this drift happened much slower (more
aggressive power saving in HLT in newer steppings made it worse; that is why
idle=poll helps) and could be often ignored. But technically it was still a
bug there because it would could break timing after long uptimes.

New multi socket K8 boxes are generally
totally unusable with TSC because they use cpufreq and the TSCs can run
at completely differently frequencies, which obviously doesn't give very
good timing information if you assume the TSC is globally synchronized.

That is why later kernels default to TSC off. The original plan
was to use HPET then, which is slower than TSC, but still not that bad.
But while most modern systems have a HPET timer somewhere in the chipset
nearly all BIOS vendors "forgot" to describe it in the BIOS because Windows
didn't use it and Linux can't find it because of that.

Then it has to use the ACPI pmtmr which is really really slow.
The overhead of that thing is so large that you can clearly see it in
the network benchmark.

The real fix long term is to change the timer subsystem to keep all TSC
state per CPU, then it'll work on the K8s too. Unfortunately it's a moderately
hard problem to make the result still fully monotonic. But people are working
on it.

-Andi

2006-09-18 16:29:01

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Hello!

> Hmm, not sure how that could happen. Also is it a real problem
> even if it could?

As I said, the problem is _occasionally_ theoretical.

This would happen f.e. if packet socket handler was installed
after IP handler. Then tcpdump would get packet after it is processed
(acked/replied/forwarded). This would be disasterous, the results
are unparsable.

I recall, the issue was discussed, and that time it looked more
reasonable to solve problems of this kind taking timestamp once
before it is seen by all the rest of stack. Who could expect that
PIT nightmare is going to return? :-)


> Then it has to use the ACPI pmtmr which is really really slow.
> The overhead of that thing is so large that you can clearly see it in
> the network benchmark.

I see. Thank you.

Alexey

2006-09-18 16:50:31

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Monday 18 September 2006 18:28, Alexey Kuznetsov wrote:
> Hello!
>
> > Hmm, not sure how that could happen. Also is it a real problem
> > even if it could?
>
> As I said, the problem is _occasionally_ theoretical.
>
> This would happen f.e. if packet socket handler was installed
> after IP handler. Then tcpdump would get packet after it is processed
> (acked/replied/forwarded). This would be disasterous, the results
> are unparsable.

But that never happens right?

And do you have some other prefered way to solve this? Even if the timer
was fast it would be still good to avoid it in the fast path when DHCPD
is running.

I suppose in the worst case a sysctl like Vladimir asked for could be added,
but it would seem somewhat lame.

-Andi

2006-09-18 21:03:33

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Hello!

> But that never happens right?

Right.

Well, not right. It happens. Simply because you get packet
with newer timestamp after previous handler saw this packet
and did some actions. I just do not see any bad consequences.


> And do you have some other prefered way to solve this? Even if the timer
> was fast it would be still good to avoid it in the fast path when DHCPD
> is running.

No. The way, which you suggested, seems to be the best.


1. It even does not disable possibility to record timestamp inside
driver, which Alan was afraid of. The sequence is:

if (!skb->tstamp.off_sec)
net_timestamp(skb);

2. Maybe, netif_rx() should continue to get timestamp in netif_rx().

3. NAPI already introduced almost the same inaccuracy. And it is really
silly to waste time getting timestamp in netif_receive_skb() a few
moments before the packet is delivered to a socket.

4. ...but clock source, which takes one of top lines in profiles
must be repaired yet. :-)

Alexey

2006-09-18 21:08:56

by Vladimir B. Savkin

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Mon, Sep 18, 2006 at 01:27:57PM +0200, Andi Kleen wrote:
> The codebase for timing (and lots of other things) is quite different
> between 32bit and 64bit. You're really surprised it doesn't work if you do such things?
>
It works, and after your remark above, I'm surprised.
Dunno about slow TSC drift though, there was not enough time passed to
detect it, and I hope we will have this problem soved in a better way
before the drift becomes visible :)

> > But the question is, why stock 2.6.18-rc7 could not use TSC on its own?
>
> x86-64 doesn't use the TSC when it deems it to not be reliable, which
> is the case on your system.
>
Could it at least print something so that I know that using TSC was
considered, but rejected?

> > What hardware exactly. Doesn't it affect only CPU? And they are not
> > know to fail before any other components.
>
> All hardware. It's basic physics.

Hm, what other hardware is affected by idle=poll? Does this option ear
out HDDs?
~
:wq
With best regards,
Vladimir Savkin.

2006-09-18 21:18:05

by Vladimir B. Savkin

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Mon, Sep 18, 2006 at 06:50:22PM +0200, Andi Kleen wrote:
>
> I suppose in the worst case a sysctl like Vladimir asked for could be added,
> but it would seem somewhat lame.
>
Please think about it this way:
suppose you haave a heavily loaded router and some network problem is to
be diagnosed. You run tcpdump and suddenly router becomes overloaded (by
switching to timestamp-it-all mode), drops OSPF adjancecies etc. Users
are angry, and you can't diagnose anything. But with impresise
timestamps and maybe even with reordered packets you still have some
traces to analyze.
So, in this particular corner case it's not that lame.

Or maybe patching tcpdump will do better?
~
:wq
With best regards,
Vladimir Savkin.

2006-09-18 21:23:16

by David Miller

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

From: Alexey Kuznetsov <[email protected]>
Date: Tue, 19 Sep 2006 01:03:21 +0400

> 1. It even does not disable possibility to record timestamp inside
> driver, which Alan was afraid of. The sequence is:
>
> if (!skb->tstamp.off_sec)
> net_timestamp(skb);
>
> 2. Maybe, netif_rx() should continue to get timestamp in netif_rx().
>
> 3. NAPI already introduced almost the same inaccuracy. And it is really
> silly to waste time getting timestamp in netif_receive_skb() a few
> moments before the packet is delivered to a socket.
>
> 4. ...but clock source, which takes one of top lines in profiles
> must be repaired yet. :-)

Ok, ok, but don't we have queueing disciplines that need the timestamp
even on ingress?

2006-09-18 21:46:17

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Hello!

> Ok, ok, but don't we have queueing disciplines that need the timestamp
> even on ingress?

I cannot find.

ip_queue does. But it is just another user, not different of sockets.

BTW in any case, any user of timestamp who sees 0, because skb was received
before timestamping was enabled, has to calculate timestamp itself right
in the place where Andi suggested. Seems, preparation to the change
makes sense even without the change. :-)

Alexey

2006-09-18 22:00:45

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Hello!

> Please think about it this way:
> suppose you haave a heavily loaded router and some network problem is to
> be diagnosed. You run tcpdump and suddenly router becomes overloaded (by
> switching to timestamp-it-all mode

I am sorry. I cannot think that way. :-)

Instead of attempts to scare, better resend original report,
where you said how much performance degraded, I cannot find it.

* I do see get_offset_pmtmr() in top lines of profile. That's scary enough.
* I do not undestand what the hell dhcp needs timestamps for.
* I do not listen any suggestions to screw up tcpdump with a sysctl.
Kernel already implements much better thing then a sysctl.
Do not want timestamps? Fix tcpdump, add an options, submit the
patch to tcpdump maintainers. Not a big deal.

Alexey

2006-09-18 22:03:33

by Vladimir B. Savkin

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Tue, Sep 19, 2006 at 02:00:38AM +0400, Alexey Kuznetsov wrote:
> Hello!
>
> > Please think about it this way:
> > suppose you haave a heavily loaded router and some network problem is to
> > be diagnosed. You run tcpdump and suddenly router becomes overloaded (by
> > switching to timestamp-it-all mode
>
> I am sorry. I cannot think that way. :-)
>
> Instead of attempts to scare, better resend original report,
> where you said how much performance degraded, I cannot find it.
>
> * I do see get_offset_pmtmr() in top lines of profile. That's scary enough.

I had it at the very top line.

> * I do not undestand what the hell dhcp needs timestamps for.
> * I do not listen any suggestions to screw up tcpdump with a sysctl.
> Kernel already implements much better thing then a sysctl.
> Do not want timestamps? Fix tcpdump, add an options, submit the
> patch to tcpdump maintainers. Not a big deal.

OK, point taken.
It's better to patch tcpdump.

>
> Alexey
>
~
:wq
With best regards,
Vladimir Savkin.

2006-09-18 22:09:34

by David Lang

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Tue, 19 Sep 2006, Alexey Kuznetsov wrote:

> Hello!
>
>> Please think about it this way:
>> suppose you haave a heavily loaded router and some network problem is to
>> be diagnosed. You run tcpdump and suddenly router becomes overloaded (by
>> switching to timestamp-it-all mode
>
> I am sorry. I cannot think that way. :-)
>
> Instead of attempts to scare, better resend original report,
> where you said how much performance degraded, I cannot find it.
>
> * I do see get_offset_pmtmr() in top lines of profile. That's scary enough.
> * I do not undestand what the hell dhcp needs timestamps for.
> * I do not listen any suggestions to screw up tcpdump with a sysctl.
> Kernel already implements much better thing then a sysctl.
> Do not want timestamps? Fix tcpdump, add an options, submit the
> patch to tcpdump maintainers. Not a big deal.

if fireing up one program (however minor) can cause network performance to drop
by >50% (based on the numbers reported earlier in this thread) that is a
significant problem for sysadmins.

yes tcpdump may be wrong in requesting timestamps (in most cases it probably is,
but in some cases it's doing exactly what the sysadmin wants it to do), but I
don't think that many sysadmins would expect this much of a performance hit.
there should be some way to tell the system to ignore requests for timestamps so
that a badly behaved program cannot cripple the system this way (and preferably
something that doesn't require a full SELinux/capabilities implementation)

David Lang

2006-09-19 05:55:51

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Monday 18 September 2006 23:03, Alexey Kuznetsov wrote:

>
> > And do you have some other prefered way to solve this? Even if the timer
> > was fast it would be still good to avoid it in the fast path when DHCPD
> > is running.
>
> No. The way, which you suggested, seems to be the best.

Ok. I also checked my desktop and for some reason I got a timestamp counter
of 7 (and it doesn't even run client dhcp). Haven't investigated why yet, and I am
still hoping it's not a leak.

But that hints that trying to fix all of user space to not use the ioctl
would have been probably too much work.


> 1. It even does not disable possibility to record timestamp inside
> driver, which Alan was afraid of. The sequence is:
>
> if (!skb->tstamp.off_sec)
> net_timestamp(skb);
>
> 2. Maybe, netif_rx() should continue to get timestamp in netif_rx().

Hmm, there are still quite a lot users and even with netif_rx() you
can have long delays from interrupt mitigation etc.

% grep -rw netif_rx drivers/net/* | wc -l
253

> 3. NAPI already introduced almost the same inaccuracy. And it is really
> silly to waste time getting timestamp in netif_receive_skb() a few
> moments before the packet is delivered to a socket.
>
> 4. ...but clock source, which takes one of top lines in profiles
> must be repaired yet. :-)

It's being worked on, but it'll take some time. But even when TSC
can be used it's still a good idea to not call gtod unnecessarily
because it can be still relatively slow (e.g. on P4 RDTSC takes
hundreds of cycles because it synchronizes the CPU). Also on some
other non x86 platforms it is also relatively slow because they have
to reach out to the chipset and every time you do that things get slow.

-Andi

2006-09-19 05:55:54

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Monday 18 September 2006 23:22, David Miller wrote:

> Ok, ok, but don't we have queueing disciplines that need the timestamp
> even on ingress?

I grepped and I can't find any. The only non SIOCGTSTAMP users of the
time stamp seem to be sunrpc and conntrack and I bet both can be converted
over to jiffies without trouble.

-Andi

2006-09-19 19:40:31

by David Miller

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

From: David Lang <[email protected]>
Date: Mon, 18 Sep 2006 14:57:04 -0700 (PDT)

> yes tcpdump may be wrong in requesting timestamps (in most cases it
> probably is, but in some cases it's doing exactly what the sysadmin
> wants it to do), but I don't think that many sysadmins would expect
> this much of a performance hit. there should be some way to tell
> the system to ignore requests for timestamps so that a badly behaved
> program cannot cripple the system this way (and preferably something
> that doesn't require a full SELinux/capabilities implementation)

tcpdump is not wrong in requesting timestamps, and there are
many legitimate userland programs that call gettimeofday()
for internal timestamping _A LOT_. Such as X11 clients.

The real fact of the matter is that these x86_64 systems are using the
slowest possible time-of-day implementation, simply because it's "too
hard" currently to properly probe the most efficient mechanism which
is present in the system.

If getting the time of day is at the top of the profiles in the packet
input path, and we're only capturing a timestamp once per packet,
something is _VERY VERY_ wrong with the timestamp implementation
because think of all of the other seriously expensive things that
happen on a per-packet basis which should absolutely dwarf
timestamping in terms of cost.

2006-09-19 19:41:27

by David Miller

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

From: "Vladimir B. Savkin" <[email protected]>
Date: Tue, 19 Sep 2006 02:03:31 +0400

> On Tue, Sep 19, 2006 at 02:00:38AM +0400, Alexey Kuznetsov wrote:
> > * I do see get_offset_pmtmr() in top lines of profile. That's scary enough.
>
> I had it at the very top line.

That is just rediculious.

You can "fix" the networking by making it timestamp less but what
about things like just normal X11 clients that call gettimeofday()
at very high rates?

2006-09-19 19:46:38

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

The sky2 hardware (and others) can timestamp in hardware, but trying
to keep device ticks and system clock in sync looked too nasty
to contemplate actually using it.

2006-09-19 19:47:47

by David Miller

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

From: Alexey Kuznetsov <[email protected]>
Date: Tue, 19 Sep 2006 02:00:38 +0400

> * I do not undestand what the hell dhcp needs timestamps for.

I can't even find a reference to SIOCGSTAMP in the
dhcp-2.0pl5 or dhcp3-3.0.3 sources shipped in Ubuntu.

But I will note that tpacket_rcv() expects to always get
valid timestamps in the SKB, it does a:

if (skb->tstamp.off_sec == 0) {
__net_timestamp(skb);
sock_enable_timestamp(sk);
}

so that it can fill in the h->tp_sec and h->tp_usec
fields.

2006-09-19 20:30:46

by Thomas Graf

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

* David Miller <[email protected]> 2006-09-18 14:22
> From: Alexey Kuznetsov <[email protected]>
> Date: Tue, 19 Sep 2006 01:03:21 +0400
>
> > 1. It even does not disable possibility to record timestamp inside
> > driver, which Alan was afraid of. The sequence is:
> >
> > if (!skb->tstamp.off_sec)
> > net_timestamp(skb);
> >
> > 2. Maybe, netif_rx() should continue to get timestamp in netif_rx().
> >
> > 3. NAPI already introduced almost the same inaccuracy. And it is really
> > silly to waste time getting timestamp in netif_receive_skb() a few
> > moments before the packet is delivered to a socket.
> >
> > 4. ...but clock source, which takes one of top lines in profiles
> > must be repaired yet. :-)
>
> Ok, ok, but don't we have queueing disciplines that need the timestamp
> even on ingress?

Queueing disciplines generally only care about the time delta
between two packets, using the receive stamp would lead to
wrong results as soon as a packet is queued more than once.

However, since we recently introcued ingress queueing we
must update the stamp to make up for the delay caused by the
queue. Updating the stamp at socket enqueue time would solve
this automatically.

It seems only natural to me that the real problem is the slow
clock source which needs to be resolved regardless of the
outcome of this discussion. I believe that updating the stamp
at socket enqueue time is the right thing to do but it shouldn't
be considered as a solution to the performance problem.

2006-09-19 20:43:46

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20


> It seems only natural to me that the real problem is the slow
> clock source which needs to be resolved regardless of the
> outcome of this discussion. I believe that updating the stamp
> at socket enqueue time is the right thing to do but it shouldn't
> be considered as a solution to the performance problem.

While I agree it would be nice to fix that particular issue
(it's unfortunately hard) slow clock sources in general won't go
away. They are also in lots of other platforms.

And even if you have a fast clock source not using it when you
don't need to is better. For example some x86s can be quite
slow even reading TSCs. It's much better than pmtmr
it's still is a expensive operations that is best avoided.

-Andi

2006-09-22 15:36:34

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

Hello!

> I can't even find a reference to SIOCGSTAMP in the
> dhcp-2.0pl5 or dhcp3-3.0.3 sources shipped in Ubuntu.
>
> But I will note that tpacket_rcv() expects to always get
> valid timestamps in the SKB, it does a:

It is equally unlikely it uses mmapped packet socket (tpacket_rcv).

I even installed that dhcp on x86_64. And I do not see anything,
netstamp_needed remains zero when running both server and client.
It looks like dhcp was defamed without a guilt. :-)

Seems, Andi saw some leakage in netstamp_needed (value of 7),
but I do not see this too.


In any case, the issue is obviously more serious than just behaviour
of some applications. On my notebook one gettimeofday() takes:

0.2 us with tsc
4.6 us with pm (AND THIS CRAP IS DEFAULT!!)
9.4 us with pit (kinda expected)

It is ridiculous. Obviosuly, nobody (not only tcpdump, but everything
else) does not need such clock. Taking timestamp takes time comparable
with processing the whole tcp frame. :-) I have no idea what is possible
to do without breaking everything, but it is not something to ignore.
This timer must be shot. :-)

Alexey

2006-09-22 15:44:14

by Andi Kleen

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Friday 22 September 2006 17:35, Alexey Kuznetsov wrote:
> Hello!
>
> > I can't even find a reference to SIOCGSTAMP in the
> > dhcp-2.0pl5 or dhcp3-3.0.3 sources shipped in Ubuntu.
> >
> > But I will note that tpacket_rcv() expects to always get
> > valid timestamps in the SKB, it does a:
>
> It is equally unlikely it uses mmapped packet socket (tpacket_rcv).
>
> I even installed that dhcp on x86_64. And I do not see anything,
> netstamp_needed remains zero when running both server and client.
> It looks like dhcp was defamed without a guilt. :-)
>
> Seems, Andi saw some leakage in netstamp_needed (value of 7),
> but I do not see this too.

That came from named. It opens lots of sockets with SIOCGSTAMP.
No idea what it needs that many for.

I suspect it was either dhcpd (server) or that ppp user space daemon
the original reporter was running.

Maybe it would be a good idea to add a printk by default?

> In any case, the issue is obviously more serious than just behaviour
> of some applications. On my notebook one gettimeofday() takes:
>
> 0.2 us with tsc
> 4.6 us with pm (AND THIS CRAP IS DEFAULT!!)

This is actually quite fast. I've seen much worse ratios.

Also on some i386 kernels the pmtimer reads the register three
times to work around some buggy implementation that doesn't latch the counter
properly.

> 9.4 us with pit (kinda expected)
>
> It is ridiculous. Obviosuly, nobody (not only tcpdump, but everything
> else) does not need such clock. Taking timestamp takes time comparable
> with processing the whole tcp frame. :-) I have no idea what is possible
> to do without breaking everything, but it is not something to ignore.
> This timer must be shot. :-)

If it's a reasonably new notebook it might be actually possible to change.
The default choices are quite conservative there because in the past
there were lots of problems with notebooks changing frequency behind
the kernel's back etc. and screwing up TSC. But that shouldn't happen anymore.

If you had a 64bit laptop the kernel would likely do the right choice :)

Notebooks are easy because they are only single socket, so the only thing
needed is to keep track of the frequency (or not if you have a Core+)

-Andi

2006-09-22 16:51:14

by Rick Jones

[permalink] [raw]
Subject: Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

> That came from named. It opens lots of sockets with SIOCGSTAMP.
> No idea what it needs that many for.

IIRC ISC BIND named opens a socket for each IP it finds on the system.
Presumeably in this way it "knows" implicitly the destination IP without
using platform-specific recvfrom/whatever extensions and gets some
additional parallelism in the stack on SMP systems.

Why it needs/wants the timestamps I've no idea, I don't think it gets
them that way on all platforms. I suppose the next time I do some named
benchmarking I can try to take a closer look in the source.

rick jones