Subject: ncpfs: Connection invalid / Input-/Output Errors

Hi,

first of all: I'm unsure if i'm writing to the right list, so if i'm
wrong please just correct me.

At one of our sites we run a Novell Fileserver with some DOS Clients
and a linux server. The linux server is running an older SuSE version
with Linux 2.4.29 kernel, as well as various custom applications.
It is running quiet stable so far without bigger problems.

As we want to migrate our servers to Debian their is another system
running Debian, a Linux 2.6.12 kernel build from debianized sources and
the same custom applications as on the SuSE system. But for a reason,
we can't figure out, the novell connection on that system fails in
a random matter. It just "disappears" and logfiles (syslog and kern.log)
state that the ncpfs connection is invalid. First we thought of a
hardware problem, but that does not seem to be the reason, as we swapped
the responsible NIC and the problem keeps happening. Then we thought
it may be a kernel bug, which is maybe fixed in a newer version,
upgraded the kernel, but the situation did not change. I thought one
special application may be the point of failure, but it does run on
the other host, too - without any problem. Anyways i straced the
application to see whats happening when the connection breaks. Nothing,
that could help. It's just normal operation until it gets into an
"Input/Output Error" loop.

At the current point i don't know what to do. I don't see possibilites
to trace down the problem, nor can i find some hints via google or in
this mailinglist so i want to ask if somebody can tell me how to trace
down that problem, or give me some hints in any other way.

The ncpfs software running on the server is 2.2.6, while the server
without problems is running 2.2.0.18.

Thanks in advance

Greets
Patrick Sch?nfeld

IN MEDIAS RES
-=Operations=-


2005-09-07 12:11:24

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: ncpfs: Connection invalid / Input-/Output Errors

On Wed, 2005-09-07 at 13:08 +0200, schönfeld / in-medias-res wrote:
> At one of our sites we run a Novell Fileserver with some DOS Clients
> and a linux server. The linux server is running an older SuSE version
> with Linux 2.4.29 kernel, as well as various custom applications.
> It is running quiet stable so far without bigger problems.
>
> As we want to migrate our servers to Debian their is another system
> running Debian, a Linux 2.6.12 kernel build from debianized sources and
> the same custom applications as on the SuSE system. But for a reason,
> we can't figure out, the novell connection on that system fails in
> a random matter. It just "disappears" and logfiles (syslog and kern.log)
> state that the ncpfs connection is invalid. First we thought of a
> hardware problem, but that does not seem to be the reason, as we swapped
> the responsible NIC and the problem keeps happening. Then we thought
> it may be a kernel bug, which is maybe fixed in a newer version,
> upgraded the kernel, but the situation did not change. I thought one
> special application may be the point of failure, but it does run on
> the other host, too - without any problem. Anyways i straced the
> application to see whats happening when the connection breaks. Nothing,
> that could help. It's just normal operation until it gets into an
> "Input/Output Error" loop.
>
> At the current point i don't know what to do. I don't see possibilites
> to trace down the problem, nor can i find some hints via google or in
> this mailinglist so i want to ask if somebody can tell me how to trace
> down that problem, or give me some hints in any other way.

Are you using IPX or TCP/IP or UDP? Are you using the same on both?
Are the two boxes in the same place and on the same connection/the same
speed? For example if one box is sitting close to the netware server
and the other further away, on a congested network, it is much more
likely to loose the connection. Also IPX is much worse than UDP. Our
connection loss problems decreased a lot when we moved from IPX to UDP.
Haven't had much experience with TCP/IP yet. Also so far we have not
seen any connection loss problems since we switched from 2.4 to 2.6
kernels (suse 9.3, i.e. 2.6.11.4-21.9).

One of the reasons for a connection disappearing is that the NCP
sequence numbers on the netware server and the linux client become out
of sync. When the netware server detects this it shuts down the
connetion. Linux can't do reconnects so you get exactly the errors you
see and the connection is gone. The fix is to umount and to mount again
when this happens.

To see if this is your problem, insert some printk()s in the relevant
ncpfs code (depends whether you are using ipx or tcp/udp as to where)
and see if they are triggered. We have been trying to track this down
for years and have failed so far... We were hoping the problems had
gone away with the 2.6 kernel but if you are seeing them maybe we will
start seing them once term starts and Linux is used more again... (We
only switched to 2.6 this summer.)

> The ncpfs software running on the server is 2.2.6, while the server
> without problems is running 2.2.0.18.

That is irrelevant. Only the kernel driver version matters.

Hope this is useful.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

Subject: Re: ncpfs: Connection invalid / Input-/Output Errors

Hi,

thanks for your answere.

Anton Altaparmakov schrieb:
> Are you using IPX or TCP/IP or UDP? Are you using the same on both?

Sorry missed pointing that out. We are using IPX. I don't think it'll
be that easy to switch to anything other :/

> Are the two boxes in the same place and on the same connection/the same
> speed? For example if one box is sitting close to the netware server
> and the other further away, on a congested network, it is much more
> likely to loose the connection.

Both systems are local and therefore thats not the difference,
between them.

> Also IPX is much worse than UDP. Our connection loss
> problems decreased a lot when we moved from IPX to UDP.
> Haven't had much experience with TCP/IP yet. Also so far we have not
> seen any connection loss problems since we switched from 2.4 to 2.6
> kernels (suse 9.3, i.e. 2.6.11.4-21.9).

Well i can imagine that IPX is much worse than UDP ("IPX just sucks").
Unfortunately it doesn't seem to be that easy to switch that system
over to UDP, cause the Novell Server is in center of a whole system,
which has to be highly available, so we don't want to touch it.

> One of the reasons for a connection disappearing is that the NCP
> sequence numbers on the netware server and the linux client become out
> of sync. When the netware server detects this it shuts down the
> connetion. Linux can't do reconnects so you get exactly the errors you
> see and the connection is gone. The fix is to umount and to mount again
> when this happens.

Uhmm... then remains the question: Why should that happen on the first
machine but not on the second?


> To see if this is your problem, insert some printk()s in the relevant
> ncpfs code (depends whether you are using ipx or tcp/udp as to where)

Well - i'm using IPX. So where do i insert the printk()s? And what kind
of printk()s should i insert? Please don't think of me as an idiot,
but i'm just not firm with "kernel hacking".

> Hope this is useful.

A little bit. Thanks anywaysw

Greets
Patrick

2005-09-07 16:24:35

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ncpfs: Connection invalid / Input-/Output Errors

sch?nfeld / in-medias-res wrote:
> Hi,
>
> thanks for your answere.

> Uhmm... then remains the question: Why should that happen on the first
> machine but not on the second?

Enable displaying of connection watchdog logouts on the server. Do not
use 'intr' mount option. Do not send KILL signal to the connection
which is waiting for reply from server. If you are not sure that your
network infrastructure is fine, use 'hard' mount option to disable
timeouts altogether.

>>To see if this is your problem, insert some printk()s in the relevant
>>ncpfs code (depends whether you are using ipx or tcp/udp as to where)
>
> Well - i'm using IPX. So where do i insert the printk()s? And what kind
> of printk()s should i insert? Please don't think of me as an idiot,
> but i'm just not firm with "kernel hacking".

Into 'ncp_invalidate_conn()', or better, into its callers. One is in
__abort_ncp_connection (invoked for IPX connections when
__ncpdgram_timeout_proc fires), second is in ncp_do_request (if server
reports some problem, or if KILL signal is sent to the process).
Petr

Subject: Re: ncpfs: Connection invalid / Input-/Output Errors

Hi Petr,

Petr Vandrovec schrieb:
> Enable displaying of connection watchdog logouts on the server. Do not
> use 'intr' mount option. Do not send KILL signal to the connection
> which is waiting for reply from server. If you are not sure that your
> network infrastructure is fine, use 'hard' mount option to disable
> timeouts altogether.

well, the thing with KILL signals is something i found after reading
your email. You did write that to another person a while ago. Now i
found that i missed a thing when i looked for differences between the two
server and got a suspicion on my mind. The only real difference between
the two servers is that the one with the problems does run a nagios nrpe
server and some plugins, e.g. to check disk space on the novell disk,
while the other server does not. Now i found that heavy operations on
the filesystem (e.g. stat'ing many small files in a short time) is a
kind of problematic, if you want to do anything else on the filesystem
at the same time. The second process just hangs until the first one
accessing the ncp filesystem is ready with its operation. Well if
nagios pretends to run a check it does send a request to the nrpe
server, which will start a plugin to check what it has to check.
Now the problem is, that the plugin will not return a result until
the timeout (i'm quiet sure that one exists) exceeds. The only
question now is: Does NRPE Server send a SIGKILL to the plugin when time
out exceeds? I'll try that. Maybe the dog lies buried there.

For now: Thanks for your help. I'll try that first and then eventually
the printk-thing.

> Into 'ncp_invalidate_conn()', or better, into its callers. One is in
> __abort_ncp_connection (invoked for IPX connections when
> __ncpdgram_timeout_proc fires), second is in ncp_do_request (if server
> reports some problem, or if KILL signal is sent to the process).

Ok, thanks.

Greets
Patrick

2005-09-09 10:46:12

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ncpfs: Connection invalid / Input-/Output Errors

sch?nfeld / in-medias-res wrote:
> Hi Petr,
>
> the two servers is that the one with the problems does run a nagios nrpe
> server and some plugins, e.g. to check disk space on the novell disk,
> while the other server does not. Now i found that heavy operations on
> the filesystem (e.g. stat'ing many small files in a short time) is a
> kind of problematic, if you want to do anything else on the filesystem
> at the same time. The second process just hangs until the first one
> accessing the ncp filesystem is ready with its operation. Well if

You need either another CPU, or semaphore which do not suffer from starvation.
Or you have to rewrite ncpfs to use some queue instead of simple
semaphore. What happens is that your copy process in a loop acquires
ncp_server's semaphore, sends request to server, waits for response, and
releases semaphore. It does that for every request sent out. Now your
process comes in, finds that ncp_server's semaphore is locked, and starts
waiting. Other process gets answer from server, releases semaphore, and
as both processes were just waiting before this happened, they both have
same priority, and so one which just did up() continues to run. And
before waken up process gets chance to do its task, copy process sends
another request, and so your second process goes to sleep again.
Petr

Subject: Re: ncpfs: Connection invalid / Input-/Output Errors

Petr Vandrovec schrieb:
> sch?nfeld / in-medias-res wrote:
>
>> Hi Petr,
>>
>> the two servers is that the one with the problems does run a nagios nrpe
>> server and some plugins, e.g. to check disk space on the novell disk,
>> while the other server does not. Now i found that heavy operations on
>> the filesystem (e.g. stat'ing many small files in a short time) is a
>> kind of problematic, if you want to do anything else on the filesystem
>> at the same time. The second process just hangs until the first one
>> accessing the ncp filesystem is ready with its operation. Well if
>
>
> You need either another CPU, or semaphore which do not suffer from
> starvation.
> Or you have to rewrite ncpfs to use some queue instead of simple
> semaphore. What happens is that your copy process in a loop acquires
> ncp_server's semaphore, sends request to server, waits for response, and
> releases semaphore. It does that for every request sent out. Now your
> process comes in, finds that ncp_server's semaphore is locked, and starts
> waiting. Other process gets answer from server, releases semaphore, and
> as both processes were just waiting before this happened, they both have
> same priority, and so one which just did up() continues to run. And
> before waken up process gets chance to do its task, copy process sends
> another request, and so your second process goes to sleep again.

Ah thanks. That makes things a lot of clearer.

I found out that my attemption were true: the plugin really gets a KILL
signal if it exceeds the timeout. Means that the nagios check plugin is
the source of the problem (in combination with that what you did explain
AND the process which uses the ncpfs regulary and is running constant).
Now we found a solution for that. We just start the always
running process with a lower priority. That makes ncpfs access possible
while this process is running and producing load. Now: If we have the
always running process running, with low priority (nice +5), and the
nagios plugin tries to do something on the ncpfs it is able to, runs
fine and exits gracefully. Problem solved, at least until we find a
solution that does not look like a workaround ;-)

Thanks for your help! You helped me very much.

Bye
Patrick