2015-10-21 15:35:24

by Richard Kojedzinszky

[permalink] [raw]
Subject: nfs lockup

Dear devs,

We have an nfs lockup issue. We run a ganeti cluster consisting of 7
debian linux nodes and 1 freenas for hosting the vm images. The images are
exported via nfsv3. The problem is that randomly we end in a livelock on
one of our nodes.

That means the nfs share is alive, we can list directories, files, even
can read files (very slow, see later). And even can write to files, but
the file close operation does not return, it gets blocked.

The read is slow in that way that while copying a file from the share to
/tmp, the data arrives very fast to the node, but in /tmp it accumulates
slowly.

I've also opened a debian bug report on it, but I think it is not related
to debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).

The only way is to reboot machine, with all the vm's running on it getting
interrupted.

I've captured each tasks' stack trace, hopefully it helps someone to find
out the issue.

Meanwhile the other 6 nodes can access the nfs share right, so I think
this is not a networking or server issue. Restarting the nfs server on the
server side still does not have any effect, not recovering. The nfs tcp
connection is established, listing files works again, but writes not.

Some information of the nodes:
# uname -a
Linux host 3.16.0-4-amd64 #1 SMP Debian
3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux

They have 1.5G ram allocated to dom0, that should be enough.

I know this information is little information, give me advice what to look
for next time. Unfortunately I dont know how to reproduce it.

Thanks in advance,

Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.


Attachments:
all.trace.txt.gz (31.79 kB)

2015-10-21 19:05:27

by Benjamin Coddington

[permalink] [raw]
Subject: Re: nfs lockup

On Wed, 21 Oct 2015, [email protected] wrote:

> Dear devs,
>
> We have an nfs lockup issue. We run a ganeti cluster consisting of 7 debian
> linux nodes and 1 freenas for hosting the vm images. The images are exported
> via nfsv3. The problem is that randomly we end in a livelock on one of our
> nodes.
>
> That means the nfs share is alive, we can list directories, files, even can
> read files (very slow, see later). And even can write to files, but the file
> close operation does not return, it gets blocked.
>
> The read is slow in that way that while copying a file from the share to /tmp,
> the data arrives very fast to the node, but in /tmp it accumulates slowly.
>
> I've also opened a debian bug report on it, but I think it is not related to
> debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>
> The only way is to reboot machine, with all the vm's running on it getting
> interrupted.
>
> I've captured each tasks' stack trace, hopefully it helps someone to find out
> the issue.
>
> Meanwhile the other 6 nodes can access the nfs share right, so I think this is
> not a networking or server issue. Restarting the nfs server on the server side
> still does not have any effect, not recovering. The nfs tcp connection is
> established, listing files works again, but writes not.
>
> Some information of the nodes:
> # uname -a
> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19)
> x86_64 GNU/Linux
>
> They have 1.5G ram allocated to dom0, that should be enough.
>
> I know this information is little information, give me advice what to look for
> next time. Unfortunately I dont know how to reproduce it.
>
> Thanks in advance,
>
> Kojedzinszky Richard
> Euronet Magyarorszag Informatika Zrt.

I took a look at your debian bug report.. what's up with those drbd procs?
Are you writing to drbd-backed devs, and have you made sure that's not
involved in any way?

Ben

2015-10-21 20:09:21

by Richard Kojedzinszky

[permalink] [raw]
Subject: Re: nfs lockup


No, the lock is nothing to do with drbd. In the ganeti cluster some vms
use drbd mirrored disks, but others use images on shared folder on nfs.
That locks up sometimes. Drbd devices do work well, every network
connectivity work well.

Please give me advice, what to check next time. Unfortunately I cannot
reproduce the problem.

Could the 9000 MTU setting affect NFS somehow? Does that count that we are
using xen, and thus a hypervisor is involved (regarding drbd it does).

Thanks,


Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

On Wed, 21 Oct 2015, Benjamin Coddington wrote:

> Date: Wed, 21 Oct 2015 15:05:24 -0400 (EDT)
> From: Benjamin Coddington <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Subject: Re: nfs lockup
>
> On Wed, 21 Oct 2015, [email protected] wrote:
>
>> Dear devs,
>>
>> We have an nfs lockup issue. We run a ganeti cluster consisting of 7 debian
>> linux nodes and 1 freenas for hosting the vm images. The images are exported
>> via nfsv3. The problem is that randomly we end in a livelock on one of our
>> nodes.
>>
>> That means the nfs share is alive, we can list directories, files, even can
>> read files (very slow, see later). And even can write to files, but the file
>> close operation does not return, it gets blocked.
>>
>> The read is slow in that way that while copying a file from the share to /tmp,
>> the data arrives very fast to the node, but in /tmp it accumulates slowly.
>>
>> I've also opened a debian bug report on it, but I think it is not related to
>> debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>>
>> The only way is to reboot machine, with all the vm's running on it getting
>> interrupted.
>>
>> I've captured each tasks' stack trace, hopefully it helps someone to find out
>> the issue.
>>
>> Meanwhile the other 6 nodes can access the nfs share right, so I think this is
>> not a networking or server issue. Restarting the nfs server on the server side
>> still does not have any effect, not recovering. The nfs tcp connection is
>> established, listing files works again, but writes not.
>>
>> Some information of the nodes:
>> # uname -a
>> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19)
>> x86_64 GNU/Linux
>>
>> They have 1.5G ram allocated to dom0, that should be enough.
>>
>> I know this information is little information, give me advice what to look for
>> next time. Unfortunately I dont know how to reproduce it.
>>
>> Thanks in advance,
>>
>> Kojedzinszky Richard
>> Euronet Magyarorszag Informatika Zrt.
>
> I took a look at your debian bug report.. what's up with those drbd procs?
> Are you writing to drbd-backed devs, and have you made sure that's not
> involved in any way?
>
> Ben
>

2015-10-22 11:18:00

by Benjamin Coddington

[permalink] [raw]
Subject: Re: nfs lockup

It looks like a lot of processes are waiting on i_mutex in
generic_file_write_iter(). Possible you're in a particularly
bad spot of contention for that mutex?

Maybe you might use the 'perf-top' tool to dig in to what the system seems to be doing
when this happens..

On Wed, 21 Oct 2015, [email protected] wrote:

>
> No, the lock is nothing to do with drbd. In the ganeti cluster some vms use
> drbd mirrored disks, but others use images on shared folder on nfs. That locks
> up sometimes. Drbd devices do work well, every network connectivity work well.
>
> Please give me advice, what to check next time. Unfortunately I cannot
> reproduce the problem.
>
> Could the 9000 MTU setting affect NFS somehow? Does that count that we are
> using xen, and thus a hypervisor is involved (regarding drbd it does).
>
> Thanks,
>
>
> Kojedzinszky Richard
> Euronet Magyarorszag Informatika Zrt.
>
> On Wed, 21 Oct 2015, Benjamin Coddington wrote:
>
> > Date: Wed, 21 Oct 2015 15:05:24 -0400 (EDT)
> > From: Benjamin Coddington <[email protected]>
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: nfs lockup
> >
> > On Wed, 21 Oct 2015, [email protected] wrote:
> >
> > > Dear devs,
> > >
> > > We have an nfs lockup issue. We run a ganeti cluster consisting of 7
> > > debian
> > > linux nodes and 1 freenas for hosting the vm images. The images are
> > > exported
> > > via nfsv3. The problem is that randomly we end in a livelock on one of our
> > > nodes.
> > >
> > > That means the nfs share is alive, we can list directories, files, even
> > > can
> > > read files (very slow, see later). And even can write to files, but the
> > > file
> > > close operation does not return, it gets blocked.
> > >
> > > The read is slow in that way that while copying a file from the share to
> > > /tmp,
> > > the data arrives very fast to the node, but in /tmp it accumulates slowly.
> > >
> > > I've also opened a debian bug report on it, but I think it is not related
> > > to
> > > debian (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
> > >
> > > The only way is to reboot machine, with all the vm's running on it getting
> > > interrupted.
> > >
> > > I've captured each tasks' stack trace, hopefully it helps someone to find
> > > out
> > > the issue.
> > >
> > > Meanwhile the other 6 nodes can access the nfs share right, so I think
> > > this is
> > > not a networking or server issue. Restarting the nfs server on the server
> > > side
> > > still does not have any effect, not recovering. The nfs tcp connection is
> > > established, listing files works again, but writes not.
> > >
> > > Some information of the nodes:
> > > # uname -a
> > > Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19)
> > > x86_64 GNU/Linux
> > >
> > > They have 1.5G ram allocated to dom0, that should be enough.
> > >
> > > I know this information is little information, give me advice what to look
> > > for
> > > next time. Unfortunately I dont know how to reproduce it.
> > >
> > > Thanks in advance,
> > >
> > > Kojedzinszky Richard
> > > Euronet Magyarorszag Informatika Zrt.
> >
> > I took a look at your debian bug report.. what's up with those drbd procs?
> > Are you writing to drbd-backed devs, and have you made sure that's not
> > involved in any way?
> >
> > Ben
> >
>

2015-10-23 18:10:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfs lockup

On Wed, Oct 21, 2015 at 05:25:53PM +0200, [email protected] wrote:
> Dear devs,
>
> We have an nfs lockup issue. We run a ganeti cluster consisting of 7
> debian linux nodes and 1 freenas for hosting the vm images. The
> images are exported via nfsv3. The problem is that randomly we end
> in a livelock on one of our nodes.
>
> That means the nfs share is alive, we can list directories, files,
> even can read files (very slow, see later). And even can write to
> files, but the file close operation does not return, it gets
> blocked.
>
> The read is slow in that way that while copying a file from the
> share to /tmp, the data arrives very fast to the node, but in /tmp
> it accumulates slowly.

I don't understand what you mean by that. Do you have some measurements
to help quantify "very fast" and "slowly"?

--b.

>
> I've also opened a debian bug report on it, but I think it is not
> related to debian
> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>
> The only way is to reboot machine, with all the vm's running on it
> getting interrupted.
>
> I've captured each tasks' stack trace, hopefully it helps someone to
> find out the issue.
>
> Meanwhile the other 6 nodes can access the nfs share right, so I
> think this is not a networking or server issue. Restarting the nfs
> server on the server side still does not have any effect, not
> recovering. The nfs tcp connection is established, listing files
> works again, but writes not.
>
> Some information of the nodes:
> # uname -a
> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4
> (2015-09-19) x86_64 GNU/Linux
>
> They have 1.5G ram allocated to dom0, that should be enough.
>
> I know this information is little information, give me advice what
> to look for next time. Unfortunately I dont know how to reproduce
> it.
>
> Thanks in advance,
>
> Kojedzinszky Richard
> Euronet Magyarorszag Informatika Zrt.



2015-10-26 07:38:57

by Richard Kojedzinszky

[permalink] [raw]
Subject: Re: nfs lockup


I dont have exact measurements, but my observations were that the file
grew at around a few 100kbyte/s, while after a reboot this file can be
copied at a few megs/s rate.

I did a kernel upgrade to 4.2 now, and I am trying to collect more
information upon the hang. Unfortunately I dont know the exact case which
triggers this hang, thus I cannot reproduce. Measurements before the
hangs dont show any unusual to me.

Thanks in advance,
Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

On Fri, 23 Oct 2015, J. Bruce Fields wrote:

> Date: Fri, 23 Oct 2015 14:10:01 -0400
> From: J. Bruce Fields <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Subject: Re: nfs lockup
>
> On Wed, Oct 21, 2015 at 05:25:53PM +0200, [email protected] wrote:
>> Dear devs,
>>
>> We have an nfs lockup issue. We run a ganeti cluster consisting of 7
>> debian linux nodes and 1 freenas for hosting the vm images. The
>> images are exported via nfsv3. The problem is that randomly we end
>> in a livelock on one of our nodes.
>>
>> That means the nfs share is alive, we can list directories, files,
>> even can read files (very slow, see later). And even can write to
>> files, but the file close operation does not return, it gets
>> blocked.
>>
>> The read is slow in that way that while copying a file from the
>> share to /tmp, the data arrives very fast to the node, but in /tmp
>> it accumulates slowly.
>
> I don't understand what you mean by that. Do you have some measurements
> to help quantify "very fast" and "slowly"?
>
> --b.
>
>>
>> I've also opened a debian bug report on it, but I think it is not
>> related to debian
>> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=801924).
>>
>> The only way is to reboot machine, with all the vm's running on it
>> getting interrupted.
>>
>> I've captured each tasks' stack trace, hopefully it helps someone to
>> find out the issue.
>>
>> Meanwhile the other 6 nodes can access the nfs share right, so I
>> think this is not a networking or server issue. Restarting the nfs
>> server on the server side still does not have any effect, not
>> recovering. The nfs tcp connection is established, listing files
>> works again, but writes not.
>>
>> Some information of the nodes:
>> # uname -a
>> Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4
>> (2015-09-19) x86_64 GNU/Linux
>>
>> They have 1.5G ram allocated to dom0, that should be enough.
>>
>> I know this information is little information, give me advice what
>> to look for next time. Unfortunately I dont know how to reproduce
>> it.
>>
>> Thanks in advance,
>>
>> Kojedzinszky Richard
>> Euronet Magyarorszag Informatika Zrt.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>