2008-06-02 04:51:23

by Glen Turner

[permalink] [raw]
Subject: Re: [PATCH] net: add ability to clear stats via ethtool - e1000/pcnet32


> Yes, every individual Linux network administrator can re-create the
> wheel by devising their own scripts, but it makes much more sense
> to me to implement a simple general kernel mechanism once that could
> be used generically, than to have hundreds (or thousands) of Linux
> network administrators each having to do it themselves (perhaps
> multiple times if they have a variety of types of systems and types
> of NICs).

Hi Bill,

If you pull the stats using a SNMP polling tool (torrus, cacti, mrtg)
then those package's graphs give nice "did this get better or worse"
output for debugging network issues.

I'd suggest you use one of those tools rather than writing your
own scripts. Even if 99% of the time the graphs record zero errors,
knowing when those errors started is very valuable and well worth
the additional effort of configuring the tools over a command-line
or a kernel hack.

The more sophisticated tools can do alerting to Nagios should
a variable suddenly change its behaviour.

The Cisco/Juniper/everyone-else feature to run console stats
separately from SNMP stats is nice, but it's rather tuned to
the needs of router-heads and tends to fall apart when multiple
staff are debugging a fault.

If we do proceed with better command line stats then the number
of errored seconds and the worst errored second and its value
would be useful. These useful numbers can't be calculated by
the SNMP polling tools and it's hard to see how they could be
done in user-space.

Cheers, Glen

--
Glen Turner


2008-06-02 16:10:40

by Bill Fink

[permalink] [raw]
Subject: Re: [PATCH] net: add ability to clear stats via ethtool - e1000/pcnet32

On Mon, 02 Jun 2008, Glen Turner wrote:

>
> > Yes, every individual Linux network administrator can re-create the
> > wheel by devising their own scripts, but it makes much more sense
> > to me to implement a simple general kernel mechanism once that could
> > be used generically, than to have hundreds (or thousands) of Linux
> > network administrators each having to do it themselves (perhaps
> > multiple times if they have a variety of types of systems and types
> > of NICs).
>
> Hi Bill,
>
> If you pull the stats using a SNMP polling tool (torrus, cacti, mrtg)
> then those package's graphs give nice "did this get better or worse"
> output for debugging network issues.

I do use mrtg for network monitoring to determine when things go
bad, but when they do go bad, then I typically need to get much
more detailed info when troubleshooting the problem.

> I'd suggest you use one of those tools rather than writing your
> own scripts. Even if 99% of the time the graphs record zero errors,
> knowing when those errors started is very valuable and well worth
> the additional effort of configuring the tools over a command-line
> or a kernel hack.

First of all, when assisting a user, they typically aren't even
running an snmp daemon (and there might be firewall issues to
access it if they are). And I don't think the "ethtool -S" driver
stats are even accessible via SNMP (although they may contribute
to more generic interface stats which are), and it is the specific
driver stats which are often key to help diagnosing the problem.

> The more sophisticated tools can do alerting to Nagios should
> a variable suddenly change its behaviour.

Definitely useful for certain arenas.

> The Cisco/Juniper/everyone-else feature to run console stats
> separately from SNMP stats is nice, but it's rather tuned to
> the needs of router-heads and tends to fall apart when multiple
> staff are debugging a fault.

I use it all the time in coordination with network peers and
joint troubleshooting. They clear the interface stats, and they
and I can then view the interface stats as a test is run (they
give me RO access to view the stats), or vice versa depending
on whose network is being examined.

> If we do proceed with better command line stats then the number
> of errored seconds and the worst errored second and its value
> would be useful. These useful numbers can't be calculated by
> the SNMP polling tools and it's hard to see how they could be
> done in user-space.

I'm all for any improved debugging/diagnostic capabilities, including
the extremely useful ability to clear/snapshot driver stats (there
could also be an option to un-snapshot if you wanted to get back to
seeing the absolute counter values).

-Bill

2008-06-03 12:27:19

by James Cammarata

[permalink] [raw]
Subject: Re: [PATCH] net: add ability to clear stats via ethtool - e1000/pcnet32

> First of all, when assisting a user, they typically aren't even
> running an snmp daemon (and there might be firewall issues to
> access it if they are). And I don't think the "ethtool -S" driver
> stats are even accessible via SNMP (although they may contribute
> to more generic interface stats which are), and it is the specific
> driver stats which are often key to help diagnosing the problem.

Ok, just to jump back in and add my $0.02, I plan to send a patch to
the ethtool project to do what was suggested (snapshot to /var, and
diff stats against that for future reference). I would just like to
point out that this will introduce inconsistencies between _every_
other source of interface stats (ifconfig/procfs/etc). I can deal
with that, it just means one more thing to keep track of when you're
troubleshooting.

Also, I'd like to point out that the "stepping on other peoples toes"
argument is a bogus one, since most major network vendors include this
ability, and it's not exactly a concern now.

2008-06-03 12:35:43

by Ben Hutchings

[permalink] [raw]
Subject: Re: [PATCH] net: add ability to clear stats via ethtool - e1000/pcnet32

James Cammarata wrote:
> >First of all, when assisting a user, they typically aren't even
> >running an snmp daemon (and there might be firewall issues to
> >access it if they are). And I don't think the "ethtool -S" driver
> >stats are even accessible via SNMP (although they may contribute
> >to more generic interface stats which are), and it is the specific
> >driver stats which are often key to help diagnosing the problem.
>
> Ok, just to jump back in and add my $0.02, I plan to send a patch to
> the ethtool project to do what was suggested (snapshot to /var, and
> diff stats against that for future reference). I would just like to
> point out that this will introduce inconsistencies between _every_
> other source of interface stats (ifconfig/procfs/etc). I can deal
> with that, it just means one more thing to keep track of when you're
> troubleshooting.

You could perhaps add a third column to the statistics when a snapshot
exists, with the second column still showing absolute values and the
third showing the difference from the snapshot.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

2008-06-03 15:06:21

by Alan

[permalink] [raw]
Subject: Re: [PATCH] net: add ability to clear stats via ethtool - e1000/pcnet32

> Also, I'd like to point out that the "stepping on other peoples toes"
> argument is a bogus one, since most major network vendors include this
> ability, and it's not exactly a concern now.

I used to work in a large ISP - it was a huge concern then and was
enforced and managed by the less effective 'do you like your kneecaps'
approach to permissions.

Its basically impossible to write a correct non-racy application which
zeros kernel statistics and then measures the change, because you cannot
know another application did the same while you were running.

This is the most basic and blindingly obvious stuff. You should not be
able to zero the kernel stats just because you can't work perl.

Alan

2008-06-04 03:03:45

by James Cammarata

[permalink] [raw]
Subject: Re: [PATCH] net: add ability to clear stats via ethtool - e1000/pcnet32

> I used to work in a large ISP - it was a huge concern then and was
> enforced and managed by the less effective 'do you like your kneecaps'
> approach to permissions.

I work at a large ISP now, and you're absolutely right. You don't just go
around resetting interface counters on backbone routers for the hell of it,
and we never do it without customer permission while troubleshooting an
issue with a connection, that is why I said I thought it was a non-argument.
There seems to be an irrational fear of counter-based anarchy here.

> Its basically impossible to write a correct non-racy application which
> zeros kernel statistics and then measures the change, because you cannot
> know another application did the same while you were running.
>
> This is the most basic and blindingly obvious stuff. You should not be
> able to zero the kernel stats just because you can't work perl.

I've already said I'd drop the issue 4+ days ago, and that I'd be more
than happy to do it in userland as you suggested, my point was simply that
adding it to only one userland tool will lead to inconsistencies. It is not
an issue of being able to "work perl" or not.