LinuxLists.cc - Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

2009-01-21 18:02:08

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Hey Steve-

On Jan 21, 2009, at Jan 21, 2009, 12:13 PM, Steve Dickson wrote:
> Sorry for the delayed response... That darn flux capacitor broke
> again! ;-)
>
>
> Chuck Lever wrote:
>>
>> I'm all for improving the observability of the NFS client.
> Well, in theory, trace points will also touch the server and all
> of the rpc code...
>
>>
>> But I don't (yet) see the advantage of adding this complexity in the
>> mount path. Maybe the more complex and asynchronous parts of the NFS
>> client, like the cached read and write paths, are more suitable to
>> this
>> type of tool.
> Well the complexity is, at this point, due to how the trace points
> are tied to and used by the systemtap. I'm hopeful this complexity
> will die down as time goes on...

I understand that your proposed mount path changes were attempting to
provide a simple example of using trace points that could be applied
to the NFS client and server in general.

However I'm interested mostly in improving how the mount path in
specific reports problems. I'm not convinced that trace points (or
our current dprintk, for that matter) is a useful approach to solving
NFS mount issues, in specific.

But that introduces the general question of whether trace points,
dprintk, network tracing, or something else is the most appropriate
tool to address the most common troubleshooting problems in any
particular area of the NFS client or server. I'd also like some
clarity on what our problem statement is here. What problems are we
trying to address?

>> Why can't we simply improve the information content of the dprintks?
> The theory is trace point can be turned on, in production kernels,
> with
> little or no performance issues...

mount isn't a performance path, which is one reason I think trace
points might be overkill for this case.

>> Can you give a few real examples of problems that these new trace
>> points
>> can identify that better dprintks wouldn't be able to address?
> They can supply more information that can be used by both a kernel
> guy and an IT guy.... Meaning they can supply detailed structure
> information
> that a kernel guy would need as well as supplying the simple error
> code
> that an IT guy would be interested.

My point is, does that flexibility really help some poor admin who is
trying to diagnose a mount problem? Is it going to reduce the number
of calls to your support desk?

I'd like to see an example of a real mount problem or two that dprintk
isn't adequate for, but a trace point could have helped. In other
words, can we get some use cases for dprintk and trace points for
mount problems in specific? I think that would help us understand the
trade-offs a little better.

Some general use cases for trace points might also widen our dialog
about where they are appropriate to use. I'm not at all arguing
against using trace points in general, but I would like to see some
thinking about whether they are the most appropriate tool for each of
the many troubleshooting jobs we have.

>> Generally, what kind of problems do admins face that the dprintks
>> don't
>> handle today, and what are the alternatives to addressing those
>> issues?
> Not being an admin guy, I really don't have an answer for this... but
> I can say since trace point are not so much of a drag on the system as
> printks are.. with in timing issues using trace point would be a big
> advantage
> over printks

I like the idea of not depending on the system log, and that's
appropriate for performance hot paths and asynchronous paths where
timing can be an issue. That's one reason why I created the NFS and
RPC performance metrics facility.

But mount is not a performance path, and is synchronous, more or
less. In addition, mount encounters problems much more frequently
than the read or write path, because mount depends a lot on what
options are selected and the network environment its running in. It's
the first thing to try contacting the server, as well, so it "shakes
out" a lot of problems before a read or write is even done.

So something like dprintk or trace points or a network trace that have
some set up overhead might be less appropriate for mount than, say,
beefing up the error reporting framework in the mount path, just as an
example.

>> Do admins who run enterprise kernels actually use SystemTap, or do
>> they
>> fall back on network traces and other tried and true troubleshooting
>> methodologies?
> Currently to run systemtap, one need kernel debug info and kernel
> developer
> info installed on the system. Most productions system don't install
> those types
> of packages.... But with trace points those type of packages will no
> longer be
> needed, so I could definitely see admins using systemtap once its
> available...
> Look at Dtrace... people are using that now that its available and
> fairly stable.
>
>> If we think the mount path needs such instrumentation, consider
>> updating
>> fs/nfs/mount_clnt.c and net/sunrpc/rpcb_clnt.c as well.
>>
> I was just following what what was currently being debug when
> 'rpcinfo -m nfs -s mount' was set...

`rpcdebug -m nfs -s mount` also enables the dprintks in fs/nfs/
mount_clnt.c, at least. As with most dprintk infrastructure in NFS,
it's really aimed at developers and not end users or admins. The
rpcbind client is also an integral part of the mount process, so I
suggested that too.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-01-23 18:28:11

by Chuck Lever III

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

On Jan 22, 2009, at Jan 22, 2009, 10:19 AM, Steve Dickson wrote:
> Greg Banks wrote:
>> I think both dprintks and trace points are the wrong approach for
>> client-side mount problems. What you really want there is good and
>> useful diagnostic information going unconditionally via printk().
>> Mount
>> problems happen frequently enough, and are often not the client's
>> fault
>> but the server's or a firewall's, that system admins need to be
>> able to
>> work out what went wrong in retrospect by looking in syslog.
>>
>> But just because Steve chose an unfortunate example doesn't
>> invalidate
>> his point. There are plenty of gnarly logic paths in the NFS
>> client and
>> server which need better runtime diagnostics. On the server,
>> anything
>> involving an upcall to userspace . On the client, silly rename or
>> attribute caching.
> It appears I did pick an "unfortunate example"... since I was really
> trying to introduce trace points to see how they could be used...
> Maybe picking the I/O path would have been better...

Choosing mount was reasonable, as it's simple. The discussion we are
having about what tool is right for the job would have probably been
less interesting if you had stuck with the I/O path.

The big picture though, is what do we need to do to make it easier to
troubleshoot and solve problems. That is a much bigger question than
how we report errors.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-01-23 22:22:42

by Greg Banks

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Chuck Lever wrote:
>
> The big picture though, is what do we need to do to make it easier to
> troubleshoot and solve problems. That is a much bigger question than
> how we report errors.
Indeed.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-21 19:29:00

by Trond Myklebust

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

On Wed, 2009-01-21 at 13:01 -0500, Chuck Lever wrote:
> `rpcdebug -m nfs -s mount` also enables the dprintks in fs/nfs/
> mount_clnt.c, at least. As with most dprintk infrastructure in NFS,
> it's really aimed at developers and not end users or admins. The
> rpcbind client is also an integral part of the mount process, so I
> suggested that too.

This would be my main gripe with suggestions that we convert all the
existing dprintks. As Chuck says, they are pretty much a hodgepodge of
messages designed to help kernel developers to debug the NFS and RPC
code.

If you want something dtrace-like to allow administrators to run scripts
to monitor the health of their cluster and troubleshoot performance
problems, then you really want to start afresh. That really needs to be
designed as a long-term API, and should ideally represent the desired
functionality in a manner that is more or less independent of the
underlying code (something that is clearly not the case for the current
mess of dprintks). Otherwise, scripts will have to be rewritten every
time we make some minor tweak or change to the code (i.e. for every
kernel release).

Cheers
Trond

2009-01-21 19:37:44

by Steve Dickson

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Chuck Lever wrote:
> Hey Steve-
>
> On Jan 21, 2009, at Jan 21, 2009, 12:13 PM, Steve Dickson wrote:
>> Sorry for the delayed response... That darn flux capacitor broke
>> again! ;-)
>>
>>
>> Chuck Lever wrote:
>>>
>>> I'm all for improving the observability of the NFS client.
>> Well, in theory, trace points will also touch the server and all
>> of the rpc code...
>>
>>>
>>> But I don't (yet) see the advantage of adding this complexity in the
>>> mount path. Maybe the more complex and asynchronous parts of the NFS
>>> client, like the cached read and write paths, are more suitable to this
>>> type of tool.
>> Well the complexity is, at this point, due to how the trace points
>> are tied to and used by the systemtap. I'm hopeful this complexity
>> will die down as time goes on...
>
> I understand that your proposed mount path changes were attempting to
> provide a simple example of using trace points that could be applied to
> the NFS client and server in general.
Very true... Its definitely just a template... If/when we agree to a
format of the template, I would like to simple clone it through the
rest of the code.

> However I'm interested mostly in improving how the mount path in
> specific reports problems. I'm not convinced that trace points (or our
> current dprintk, for that matter) is a useful approach to solving NFS
> mount issues, in specific.
>
> But that introduces the general question of whether trace points,
> dprintk, network tracing, or something else is the most appropriate tool
> to address the most common troubleshooting problems in any particular
> area of the NFS client or server. I'd also like some clarity on what
> our problem statement is here. What problems are we trying to address?
The problem I'm trying to address is allowing admins to debug (or decipher)
NFS problems on production system in a very non-intrusive way. Meaning
having no ill effects on performance or stability when the trace points
are enabled.

>
>>> Why can't we simply improve the information content of the dprintks?
>> The theory is trace point can be turned on, in production kernels, with
>> little or no performance issues...
>
> mount isn't a performance path, which is one reason I think trace points
> might be overkill for this case.
Maybe so, but again, it was one of the easier paths to convert. Would it
be more palatable if I converted the I/O paths?

>
>>> Can you give a few real examples of problems that these new trace points
>>> can identify that better dprintks wouldn't be able to address?
>> They can supply more information that can be used by both a kernel
>> guy and an IT guy.... Meaning they can supply detailed structure
>> information
>> that a kernel guy would need as well as supplying the simple error code
>> that an IT guy would be interested.
>
> My point is, does that flexibility really help some poor admin who is
> trying to diagnose a mount problem? Is it going to reduce the number of
> calls to your support desk?
I think so... Once the admin either learn what is available and how
to use them they will be able better more concise bug reports. So maybe
there may not a decrease in calls but each caller (potentially) will
supply the support desk with better information.

>
> I'd like to see an example of a real mount problem or two that dprintk
> isn't adequate for, but a trace point could have helped. In other
> words, can we get some use cases for dprintk and trace points for mount
> problems in specific? I think that would help us understand the
> trade-offs a little better.
In the mount path that might be a bit difficult... but with trace
points you would be able to look at the entire super block or entire
server and client structures something you can't static/canned
printks...

>
> Some general use cases for trace points might also widen our dialog
> about where they are appropriate to use. I'm not at all arguing against
> using trace points in general, but I would like to see some thinking
> about whether they are the most appropriate tool for each of the many
> troubleshooting jobs we have.
I/O paths jumps into my head... since trace points much less of a performance
killer than printks, the I/O path might be an appropriate use...

>
>>> Generally, what kind of problems do admins face that the dprintks don't
>>> handle today, and what are the alternatives to addressing those issues?
>> Not being an admin guy, I really don't have an answer for this... but
>> I can say since trace point are not so much of a drag on the system as
>> printks are.. with in timing issues using trace point would be a big
>> advantage
>> over printks
>
> I like the idea of not depending on the system log, and that's
> appropriate for performance hot paths and asynchronous paths where
> timing can be an issue. That's one reason why I created the NFS and RPC
> performance metrics facility.
Which is total being underutilized... IMHO... I can see a combination of
using both.... Using the metrics to identify a problem and the using
trace point to solve the problem...

>
> But mount is not a performance path, and is synchronous, more or less.
> In addition, mount encounters problems much more frequently than the
> read or write path, because mount depends a lot on what options are
> selected and the network environment its running in. It's the first
> thing to try contacting the server, as well, so it "shakes out" a lot of
> problems before a read or write is even done.
>
> So something like dprintk or trace points or a network trace that have
> some set up overhead might be less appropriate for mount than, say,
> beefing up the error reporting framework in the mount path, just as an
> example.
Trace points by far have much much less overhead than printks... thats
one of their major advantages...

>
>>> Do admins who run enterprise kernels actually use SystemTap, or do they
>>> fall back on network traces and other tried and true troubleshooting
>>> methodologies?
>> Currently to run systemtap, one need kernel debug info and kernel
>> developer
>> info installed on the system. Most productions system don't install
>> those types
>> of packages.... But with trace points those type of packages will no
>> longer be
>> needed, so I could definitely see admins using systemtap once its
>> available...
>> Look at Dtrace... people are using that now that its available and
>> fairly stable.
>>
>>> If we think the mount path needs such instrumentation, consider updating
>>> fs/nfs/mount_clnt.c and net/sunrpc/rpcb_clnt.c as well.
>>>
>> I was just following what what was currently being debug when
>> 'rpcinfo -m nfs -s mount' was set...
>
> `rpcdebug -m nfs -s mount` also enables the dprintks in
> fs/nfs/mount_clnt.c, at least. As with most dprintk infrastructure in
> NFS, it's really aimed at developers and not end users or admins. The
> rpcbind client is also an integral part of the mount process, so I
> suggested that too.
>
ACK...

steved.

2009-01-21 19:58:04

by Steve Dickson

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Hey,

Trond Myklebust wrote:
> On Wed, 2009-01-21 at 13:01 -0500, Chuck Lever wrote:
>> `rpcdebug -m nfs -s mount` also enables the dprintks in fs/nfs/
>> mount_clnt.c, at least. As with most dprintk infrastructure in NFS,
>> it's really aimed at developers and not end users or admins. The
>> rpcbind client is also an integral part of the mount process, so I
>> suggested that too.
>
> This would be my main gripe with suggestions that we convert all the
> existing dprintks. As Chuck says, they are pretty much a hodgepodge of
> messages designed to help kernel developers to debug the NFS and RPC
> code.
Well as I see it, this is our chance to clean it up...

>
> If you want something dtrace-like to allow administrators to run scripts
> to monitor the health of their cluster and troubleshoot performance
> problems, then you really want to start afresh. That really needs to be
> designed as a long-term API, and should ideally represent the desired
> functionality in a manner that is more or less independent of the
> underlying code (something that is clearly not the case for the current
> mess of dprintks).
I'm not sure how the trace points could independent of the underlying code,
but I do agree a well designed API would be optimal.... But before we go
off designing something I think we need to decide with the end game is.

Do we want to trace points:
1) at all
2) for debugging
3) for performance
4) 2 and 3

Once we get the above nailed down then we can decide how to go...

Also, Greg and Jason Baron (from Red Hat) are off working on
improving the dprintk() that are currently exist... I would
suspect we would want to also tie in with that to see if
it would be applicable...

> Otherwise, scripts will have to be rewritten every
> time we make some minor tweak or change to the code (i.e. for every
> kernel release).
No matter how well we design this, I'm sure there will always be
a need for tweaks in the user level scripts... but we call always
leave that up to the nfs-utils maintainer.... (Doh!) 8-)

steved.

2009-01-21 20:19:15

by Chuck Lever III

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

On Jan 21, 2009, at Jan 21, 2009, 2:37 PM, Steve Dickson wrote:
> Chuck Lever wrote:
>> Hey Steve-
>>
>> I'd like to see an example of a real mount problem or two that
>> dprintk
>> isn't adequate for, but a trace point could have helped. In other
>> words, can we get some use cases for dprintk and trace points for
>> mount
>> problems in specific? I think that would help us understand the
>> trade-offs a little better.
> In the mount path that might be a bit difficult... but with trace
> points you would be able to look at the entire super block or entire
> server and client structures something you can't static/canned
> printks...

I've never ever seen an NFS mount problem that required an admin to
provide information from a superblock. That seems like a lot of
implementation detail that would be meaningless to admins and support
desk folks.

This is why I think we need to have some real world customer examples
of mount problems (or read performance problems, or whatever) that we
want to be able to diagnose in enterprise distributions. I'm not
saying this to throw up a road block... I think we really need to
understand the problem before designing the solution, and so let's
start with some practical examples.

Again, I'm not saying trace points are bad or wrong, just that they
may not be appropriate for a particular code path and the type of
problems that arise during specific NFS operations. I'm not
criticizing your particular sample code. I'm asking "Before we add
trace points everywhere, are trace points strategically the right
debugging tool in every case?"

Basically we have to know well in advance what kind of information
will be needed at each trace point. Who can predict? If you have to
solder in trace points in advance, in some ways that doesn't seem any
more flexible than a dprintk. What you've demonstrated is another
good general tool for debugging, but you haven't convinced me that
this is the right tool for, say, the mount path, or ACL support, and
so on.

>> But mount is not a performance path, and is synchronous, more or
>> less.
>> In addition, mount encounters problems much more frequently than the
>> read or write path, because mount depends a lot on what options are
>> selected and the network environment its running in. It's the first
>> thing to try contacting the server, as well, so it "shakes out" a
>> lot of
>> problems before a read or write is even done.
>>
>> So something like dprintk or trace points or a network trace that
>> have
>> some set up overhead might be less appropriate for mount than, say,
>> beefing up the error reporting framework in the mount path, just as
>> an
>> example.
> Trace points by far have much much less overhead than printks... thats
> one of their major advantages...

Yeah, but that doesn't matter in some cases, like mount, or
asynchronous file deletes, or .... so we have to look at some of the
other issues with using them when deciding if they are the right tool
for the job.

I think we need to visit this issue on a case-by-case basis.
Sometimes dprintk is appropriate. Sometimes printk(KERN_ERR).
Sometimes a performance metric. Having specific troubleshooting in
mind when we design this is critical, otherwise we are going to add a
lot of kruft for no real benefit.

That's an advantage of something like SystemTap. You can specify
whatever is needed for a specific problem, and you don't need to
recompile the kernel to do it. Enterprise distributions can provide
specific scripts for their code base, which doesn't change much.
Upstream is free to make whatever drastic modifications to the code
base without worrying about breaking a kernel-user space API.

Trond has always maintained that dprintk() is best for developers, but
probably inappropriate for field debugging, and I think that may also
apply to trace points. So I'm not against adding trace points where
appropriate, but I'm doubtful that they will be helpful outside of
kernel development; ie I wonder if they will specifically help
customers of enterprise distributions.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2009-01-21 20:23:47

by Trond Myklebust

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

On Wed, 2009-01-21 at 14:58 -0500, Steve Dickson wrote:
> Do we want to trace points:
> 1) at all
> 2) for debugging
> 3) for performance
> 4) 2 and 3
>
> Once we get the above nailed down then we can decide how to go...

I think it might be a good idea to flesh out a bit what you mean by
"debugging" here. Since you mentioned it in conjunction with the two
words "administrators" and "scripts", I assume that you are not talking
about kernel code debugging?

Cheers
Trond

2009-01-21 21:26:02

by Greg Banks

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Chuck Lever wrote:
>>> Why can't we simply improve the information content of the dprintks?
>>>
>> The theory is trace point can be turned on, in production kernels,
>> with
>> little or no performance issues...
>>
>
> mount isn't a performance path,
Perhaps not on the client, but when you have >6000 clients mounting
simultaneously then mount is most definitely a performance path on the
server :-)

> which is one reason I think trace
> points might be overkill for this case.
>

I think both dprintks and trace points are the wrong approach for
client-side mount problems. What you really want there is good and
useful diagnostic information going unconditionally via printk(). Mount
problems happen frequently enough, and are often not the client's fault
but the server's or a firewall's, that system admins need to be able to
work out what went wrong in retrospect by looking in syslog.

But just because Steve chose an unfortunate example doesn't invalidate
his point. There are plenty of gnarly logic paths in the NFS client and
server which need better runtime diagnostics. On the server, anything
involving an upcall to userspace . On the client, silly rename or
attribute caching.
>
>> Not being an admin guy, I really don't have an answer for this... but
>> I can say since trace point are not so much of a drag on the system as
>> printks are.. with in timing issues using trace point would be a big
>> advantage
>> over printks
>>
>
>
Well that argument works both ways. Several times now I've seen
problems where a significant part of the debugging process has involved
noticing correlations between timing of dprintks and syslog messages
from other subsystems, like IPoIB or TCP. That's harder to do if the
debug statements and printks go through separate mechanisms to userspace.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-21 22:36:53

by Greg Banks

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Chuck Lever wrote:
>
>
> I think we need to visit this issue on a case-by-case basis.
> Sometimes dprintk is appropriate. Sometimes printk(KERN_ERR).
> Sometimes a performance metric.
Well said.

> Trond has always maintained that dprintk() is best for developers, but
> probably inappropriate for field debugging,
It's not a perfect tool but it beats nothing at all.
> and I think that may also
> apply to trace points.
It depends on whether distros can be convinced to enable it by default,
and install by default any necessary userspace infrastructure. The
most important thing for field debugging is Just Knowing that you have
all the bits necessary to perform useful debugging without having to
find some RPM that matches the kernel that the machine is actually
running now, and not the one that was present when the machine was
installed.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

2009-01-22 13:07:43

by Steve Dickson

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Trond Myklebust wrote:
> On Wed, 2009-01-21 at 14:58 -0500, Steve Dickson wrote:
>> Do we want to trace points:
>> 1) at all
>> 2) for debugging
>> 3) for performance
>> 4) 2 and 3
>>
>> Once we get the above nailed down then we can decide how to go...
>
> I think it might be a good idea to flesh out a bit what you mean by
> "debugging" here. Since you mentioned it in conjunction with the two
> words "administrators" and "scripts", I assume that you are not talking
> about kernel code debugging?
I'm talking debugging for both admins and kernel people...

With trace points and systemtap you can do both.

steved.

2009-01-22 13:55:24

by Steve Dickson

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Chuck Lever wrote:
> On Jan 21, 2009, at Jan 21, 2009, 2:37 PM, Steve Dickson wrote:
>> Chuck Lever wrote:
>>> Hey Steve-
>>>
>>> I'd like to see an example of a real mount problem or two that dprintk
>>> isn't adequate for, but a trace point could have helped. In other
>>> words, can we get some use cases for dprintk and trace points for mount
>>> problems in specific? I think that would help us understand the
>>> trade-offs a little better.
>> In the mount path that might be a bit difficult... but with trace
>> points you would be able to look at the entire super block or entire
>> server and client structures something you can't static/canned
>> printks...
>
> I've never ever seen an NFS mount problem that required an admin to
> provide information from a superblock. That seems like a lot of
> implementation detail that would be meaningless to admins and support
> desk folks.
True... but my point is with trace points and systemtap scripts
one has access to BOTH highly technical data (for the developer)
and simple error codes (for the admins).... Unlike with printks...

>
> This is why I think we need to have some real world customer examples of
> mount problems (or read performance problems, or whatever) that we want
> to be able to diagnose in enterprise distributions. I'm not saying this
> to throw up a road block... I think we really need to understand the
> problem before designing the solution, and so let's start with some
> practical examples.
I'm not sure this is an obtainable goal.... I see it as we put in a
well design infrastructure (something I think Trond is suggesting)
and then let the consumers of the infrastructure tell us what is need...
Believe there are enterprise people that know *exactly* what
they are looking for... ;-)

>
> Again, I'm not saying trace points are bad or wrong, just that they may
> not be appropriate for a particular code path and the type of problems
> that arise during specific NFS operations. I'm not criticizing your
> particular sample code. I'm asking "Before we add trace points
> everywhere, are trace points strategically the right debugging tool in
> every case?"
Good point... but the fact trace points very little overhead with them its
kinda hard to see why they would not be the right tool... But again
I do see your point...

>
> Basically we have to know well in advance what kind of information will
> be needed at each trace point. Who can predict? If you have to solder
> in trace points in advance, in some ways that doesn't seem any more
> flexible than a dprintk. What you've demonstrated is another good
> general tool for debugging, but you haven't convinced me that this is
> the right tool for, say, the mount path, or ACL support, and so on.
No worries.. I'll keep trying! ;-)

To your point, I know for a fact there are customers asking for
trace points in particular areas of the code (not the NFS code atm).
So, again, I think we should take the "build it and will come"
approach... Meaning, give people something to work with and they
will let us know what they need...

>
> I think we need to visit this issue on a case-by-case basis. Sometimes
> dprintk is appropriate. Sometimes printk(KERN_ERR). Sometimes a
> performance metric. Having specific troubleshooting in mind when we
> design this is critical, otherwise we are going to add a lot of kruft
> for no real benefit.
I can agree with this...

>
> That's an advantage of something like SystemTap. You can specify
> whatever is needed for a specific problem, and you don't need to
> recompile the kernel to do it. Enterprise distributions can provide
> specific scripts for their code base, which doesn't change much.
> Upstream is free to make whatever drastic modifications to the code base
> without worrying about breaking a kernel-user space API.
>
> Trond has always maintained that dprintk() is best for developers, but
> probably inappropriate for field debugging, and I think that may also
> apply to trace points. So I'm not against adding trace points where
> appropriate, but I'm doubtful that they will be helpful outside of
> kernel development; ie I wonder if they will specifically help customers
> of enterprise distributions.
>
Time will tell... I think once customers see how useful and powerful
traces can but they be come addicted.... fairly quickly....

steved.

2009-01-22 15:19:13

by Steve Dickson

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Greg Banks wrote:
> I think both dprintks and trace points are the wrong approach for
> client-side mount problems. What you really want there is good and
> useful diagnostic information going unconditionally via printk(). Mount
> problems happen frequently enough, and are often not the client's fault
> but the server's or a firewall's, that system admins need to be able to
> work out what went wrong in retrospect by looking in syslog.
>
> But just because Steve chose an unfortunate example doesn't invalidate
> his point. There are plenty of gnarly logic paths in the NFS client and
> server which need better runtime diagnostics. On the server, anything
> involving an upcall to userspace . On the client, silly rename or
> attribute caching.
It appears I did pick an "unfortunate example"... since I was really
trying to introduce trace points to see how they could be used...
Maybe picking the I/O path would have been better...

>>> Not being an admin guy, I really don't have an answer for this... but
>>> I can say since trace point are not so much of a drag on the system as
>>> printks are.. with in timing issues using trace point would be a big
>>> advantage
>>> over printks
>>>
>>
> Well that argument works both ways. Several times now I've seen
> problems where a significant part of the debugging process has involved
> noticing correlations between timing of dprintks and syslog messages
> from other subsystems, like IPoIB or TCP. That's harder to do if the
> debug statements and printks go through separate mechanisms to userspace.

Yes... I have seen this an number of times and places... :-(

steved.

2009-01-22 22:31:52

by Greg Banks

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/5] NFS: trace points added to mounting path

Steve Dickson wrote:
>
> True... but my point is with trace points and systemtap scripts
> one has access to BOTH highly technical data (for the developer)
> and simple error codes (for the admins).... Unlike with printks...
>
Yes, there's a lot more power in trace points.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.