2021-04-27 19:10:34

by Rogier Wolff

[permalink] [raw]
Subject: Lockd error message is unclear.


Hi,

Two things.....

I got:

lockd: cannot monitor <client>

in the logfile and the client was terrily slow/not working at all.

everything pointed to a lockd problem...

In the end... it turns out that my rpc.statd stopped working. I had
to go and download the sources to figure this out... I would firstly
suggest to improve the error message to give others running into this
more hints as to where to look.

The erorr message on line 169 of lockd.c could read:

lockd: Error in the rpc to rpc.statd to monitor %s\n

Would it be an idea to print the res.status error code?


That said...

When this situation is going on, the client grinds to a halt, and
lockd seems "stuck" in D state. I tried killing or stracing it, to try
to clear the error, before I found out it is a kernel deamon...

When this failure happens, I get the impression that lockd keeps on
trying to be "of service", retrying operations that are bound to
fail. So maybe the error should be cached, and then immediately
handled instead of making the client grind to a halt. (it is the (one
second?) timeout in nsm_mon_unmon and the big backlog of requests that
result in the same call and timeout that frustrate the client... )

Roger.


--
** [email protected] ** https://www.BitWizard.nl/ ** +31-15-2049110 **
** Delftechpark 11 2628 XJ Delft, The Netherlands. KVK: 27239233 **
f equals m times a. When your f is steady, and your m is going down
your a is going up. -- Chris Hadfield about flying up the space shuttle.


2021-04-27 19:35:26

by [email protected]

[permalink] [raw]
Subject: Re: Lockd error message is unclear.

On Tue, Apr 27, 2021 at 09:03:11PM +0200, Rogier Wolff wrote:
>
> Hi,
>
> Two things.....
>
> I got:
>
> lockd: cannot monitor <client>
>
> in the logfile and the client was terrily slow/not working at all.
>
> everything pointed to a lockd problem...
>
> In the end... it turns out that my rpc.statd stopped working. I had
> to go and download the sources to figure this out... I would firstly
> suggest to improve the error message to give others running into this
> more hints as to where to look.
>
> The erorr message on line 169 of lockd.c could read:
>
> lockd: Error in the rpc to rpc.statd to monitor %s\n
>
> Would it be an idea to print the res.status error code?

I'm not sure about the wording, but including the error code sounds like
a good idea. (Would that have made a difference in your case?)

> That said...
>
> When this situation is going on, the client grinds to a halt, and
> lockd seems "stuck" in D state. I tried killing or stracing it, to try
> to clear the error, before I found out it is a kernel deamon...
>
> When this failure happens, I get the impression that lockd keeps on
> trying to be "of service", retrying operations that are bound to
> fail. So maybe the error should be cached, and then immediately
> handled instead of making the client grind to a halt. (it is the (one
> second?) timeout in nsm_mon_unmon and the big backlog of requests that
> result in the same call and timeout that frustrate the client... )

The -ECONNREFUSED case?

I'm not sure why it retries there. Maybe just to allow stopping and
starting rpc.statd (e.g. for upgrades) without failing operations?

--b.

2021-04-27 21:12:34

by Rogier Wolff

[permalink] [raw]
Subject: Re: Lockd error message is unclear.

On Tue, Apr 27, 2021 at 03:34:52PM -0400, J. Bruce Fields wrote:
> On Tue, Apr 27, 2021 at 09:03:11PM +0200, Rogier Wolff wrote:
> >
> > Hi,
> >
> > Two things.....
> >
> > I got:
> >
> > lockd: cannot monitor <client>
> >
> > in the logfile and the client was terrily slow/not working at all.
> >
> > everything pointed to a lockd problem...
> >
> > In the end... it turns out that my rpc.statd stopped working. I had
> > to go and download the sources to figure this out... I would firstly
> > suggest to improve the error message to give others running into this
> > more hints as to where to look.
> >
> > The erorr message on line 169 of lockd.c could read:
> >
> > lockd: Error in the rpc to rpc.statd to monitor %s\n
> >
> > Would it be an idea to print the res.status error code?
>
> I'm not sure about the wording, but including the error code sounds like
> a good idea. (Would that have made a difference in your case?)

Not sure. Of course I was just "looking for a solution". So once I
figured out that rpc.statd was missing I went looking for how that
came about.

But as it was the prime culprit was "lockd is misbehaving". With a
better error message you can shift the blame away from your part of
the system. :-)

> > second?) timeout in nsm_mon_unmon and the big backlog of requests that
> > result in the same call and timeout that frustrate the client... )
>
> The -ECONNREFUSED case?
>
> I'm not sure why it retries there. Maybe just to allow stopping and
> starting rpc.statd (e.g. for upgrades) without failing operations?

Not sure IF it was retrying. Maybe not. But starting "google-chrome"
with 40 open tabs didn't progress to any tabs loading inside the half
hour that I was looking for why this was happening (unable to google
for a solution).... So in the meantime it was constantly spewing the
error message, rate limited to 10 per minute....

Roger.

--
** [email protected] ** https://www.BitWizard.nl/ ** +31-15-2049110 **
** Delftechpark 11 2628 XJ Delft, The Netherlands. KVK: 27239233 **
f equals m times a. When your f is steady, and your m is going down
your a is going up. -- Chris Hadfield about flying up the space shuttle.