2011-07-27 16:07:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 09:54:09AM -0400, Justin Piszcz wrote:
> Hi,
>
> Kernel 2.6.30 on client.
> Kernel 2.6.28 on server.
>
> p34 kernel: [92223.918892] NFS: directory motion/cam2 contains a
> readdir loop. Please contact your server vendor. Offending cookie: 10272

What filesystem on the server are you exporting?

> In the past I used NFS to push -> imagery -> NFS server.
> Now I've flipped it so I am storing the images locally and viewing
> them remotely, what causes this?

Sorry, I don't understand what you mean.

--b.


2011-07-27 20:05:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 04:02:40PM -0400, Christoph Hellwig wrote:
> But looking closer at it it only prints the directory name and not that
> of any of the matching cookies, making it pretty useless to debug any
> problem. (and it makes my previous question to Justin look stupid..).
>
>
> But so far I still stick to my previous theory that this sounds like
> a directory offset getting reused. How is cache invalidation for
> the array supposed to work? And maybe more importantly, given that he
> can only reproduce it with a .38 client did any bugs get fixed in that
> code recently that might lead to issues with the cache invalidation?

Actually we won't even need cache invalidation bugs, see
nfsd_buffered_readdir() - we might do multiple vfs_readdir calls to
fill a single nfs reply, and between these two directory contents might
have been completely replaced, in the worst (pathological case) you
might get a second readdir having exactly the same offsets, but pointing
to completely different inodes.


2011-07-27 19:35:03

by Justin Piszcz

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop



On Wed, 27 Jul 2011, Christoph Hellwig wrote:

> Justin,
>
> can you please run the attached test program on the affected directory
> on the server, and see if you see duplicates in the d_off colum. Unless
> you have privacy concerns I would also love to see the full output.
>
>

Hi,

Done:

atom:/d1/motion/cam1# /root/getdents > /tmp/cam1-out.txt
atom:/d1/motion/cam1# cd ../cam2
atom:/d1/motion/cam2# /root/getdents > /tmp/cam2-out.txt
atom:/d1/motion/cam2# cd ../cam3
atom:/d1/motion/cam3# /root/getdents > /tmp/cam3-out.txt
atom:/d1/motion/cam3#

Files:
http://home.comcast.net/~jpiszcz/20110727/cam1-out.txt
http://home.comcast.net/~jpiszcz/20110727/cam2-out.txt
http://home.comcast.net/~jpiszcz/20110727/cam3-out.txt

Currently I do not see any dupes, however I have a script that moves
images out of the directory once an hour:
0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1

I'll disable that for now and see if this recurs, if it does, I'll gather
additional output and send it out, thanks.

Justin.


2011-07-27 20:37:03

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, 2011-07-27 at 15:47 -0400, Christoph Hellwig wrote:
> On Wed, Jul 27, 2011 at 03:44:20PM -0400, Justin Piszcz wrote:
> >
> >
> > On Wed, 27 Jul 2011, Christoph Hellwig wrote:
> >
> > >On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
> > >>Currently I do not see any dupes, however I have a script that moves
> > >>images out of the directory once an hour:
> > >>0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
> > >
> > >Do you keep adding files to the directory while you move files out?
> > Yes, otherwise there are too many files in the directory and viewers, e.g.,
> > each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
> > it around 5,000 pictures or less.
> >
> > >What's the rate of additions/removals to the directory?
> > Additions it depends, around 5,000 over a 12hr period, 416/hr, current:
> >
> > atom:/d1/motion# find cam1|wc
> > 5215 5215 166853
> > atom:/d1/motion# find cam2|wc
> > 5069 5069 162181
> > atom:/d1/motion# find cam3|wc
> > 5594 5594 178981
> > atom:/d1/motion#
>
> This sounds a lot like xfs simply filling up the directory index slots
> of files that you just moved out with new files, and nfs falsely
> claiming that this is a problem.

Yep. There is an existing bugzilla report for this bug at

https://bugzilla.kernel.org/show_bug.cgi?id=38572

I have a preliminary patch there that attempts to turn off the loop
detection when the directory is seen to change, however that patch still
appears to have a bug in it, and I haven't had time to figure out what
is wrong yet.

Can you perhaps take a look, Bryan?

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2011-07-27 19:54:52

by Anna Schumaker

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On 07/27/2011 03:47 PM, Christoph Hellwig wrote:
> On Wed, Jul 27, 2011 at 03:44:20PM -0400, Justin Piszcz wrote:
>>
>>
>> On Wed, 27 Jul 2011, Christoph Hellwig wrote:
>>
>>> On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
>>>> Currently I do not see any dupes, however I have a script that moves
>>>> images out of the directory once an hour:
>>>> 0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
>>>
>>> Do you keep adding files to the directory while you move files out?
>> Yes, otherwise there are too many files in the directory and viewers, e.g.,
>> each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
>> it around 5,000 pictures or less.
>>
>>> What's the rate of additions/removals to the directory?
>> Additions it depends, around 5,000 over a 12hr period, 416/hr, current:
>>
>> atom:/d1/motion# find cam1|wc
>> 5215 5215 166853
>> atom:/d1/motion# find cam2|wc
>> 5069 5069 162181
>> atom:/d1/motion# find cam3|wc
>> 5594 5594 178981
>> atom:/d1/motion#
>
> This sounds a lot like xfs simply filling up the directory index slots
> of files that you just moved out with new files, and nfs falsely
> claiming that this is a problem.
>
> Any chance to figure out if the file you hit the printk with was one
> that got either recently added or moved when you hit it? (I can't
> follow the nfs code enough to check if it prints the first or second hit
> of the same cookie)

It should be printing on the second hit of a cookie.

- Bryan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-07-27 16:35:06

by Justin Piszcz

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop



On Wed, 27 Jul 2011, J. Bruce Fields wrote:

> On Wed, Jul 27, 2011 at 09:54:09AM -0400, Justin Piszcz wrote:
>> Hi,
>>
>> Kernel 2.6.30 on client.
>> Kernel 2.6.28 on server.
>>
>> p34 kernel: [92223.918892] NFS: directory motion/cam2 contains a
>> readdir loop. Please contact your server vendor. Offending cookie: 10272
>
> What filesystem on the server are you exporting?

Hi,

xfs.
/dev/sda1 on / type xfs (rw,noatime)

Nothing special, thoughts?

Justin.


2011-07-27 20:27:08

by Rüdiger Meier

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wednesday 27 July 2011, Christoph Hellwig wrote:
> On Wed, Jul 27, 2011 at 03:54:49PM -0400, Bryan Schumaker wrote:
> > It should be printing on the second hit of a cookie.
>
> But looking closer at it it only prints the directory name and not
> that of any of the matching cookies, making it pretty useless to
> debug any problem. (and it makes my previous question to Justin look
> stupid..).
>
>
> But so far I still stick to my previous theory that this sounds like
> a directory offset getting reused. How is cache invalidation for
> the array supposed to work? And maybe more importantly, given that
> he can only reproduce it with a .38 client did any bugs get fixed in
> that code recently that might lead to issues with the cache
> invalidation?

At the time I've started this thread
http://comments.gmane.org/gmane.linux.nfs/40863
I had the feeling that the readdir cache changings in 2.6.37 have
something to do with these loop problems.

After that thread I've accepted that's a general problem with
ext4/dirindex and nfs but seeing it again on xfs with just 5000 files
I'm in doubt again.

cu,
Rudi

2011-07-27 21:21:57

by Rüdiger Meier

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wednesday 27 July 2011, Christoph Hellwig wrote:
> On Wed, Jul 27, 2011 at 10:26:55PM +0200, R?diger Meier wrote:
> > At the time I've started this thread
> > http://comments.gmane.org/gmane.linux.nfs/40863
> > I had the feeling that the readdir cache changings in 2.6.37 have
> > something to do with these loop problems.
> >
> > After that thread I've accepted that's a general problem with
> > ext4/dirindex and nfs but seeing it again on xfs with just 5000
> > files I'm in doubt again.
>
> Two separate issues. [...]

Yup, I didn't wanted to say that I'm in doubt about the general
ext4/dirindex problem but I'am still in doubt about the complete
innocence of readdir cache.

I guess I've run into both issues at that time. I remember that I
couldn't easily create such "broken" dir from scratch but my users
managed it to have dozens of them, often just about 30000 files.
Somehow it seemed to be important that the dirs were growing in a
natural way.

However no probs again since with xfs and ext4 without dirindex. But
still the feeling that upgrading to 2.6.37 was also a part of the
problem.

cu,
Rudi


2011-07-27 19:47:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 03:44:20PM -0400, Justin Piszcz wrote:
>
>
> On Wed, 27 Jul 2011, Christoph Hellwig wrote:
>
> >On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
> >>Currently I do not see any dupes, however I have a script that moves
> >>images out of the directory once an hour:
> >>0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
> >
> >Do you keep adding files to the directory while you move files out?
> Yes, otherwise there are too many files in the directory and viewers, e.g.,
> each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
> it around 5,000 pictures or less.
>
> >What's the rate of additions/removals to the directory?
> Additions it depends, around 5,000 over a 12hr period, 416/hr, current:
>
> atom:/d1/motion# find cam1|wc
> 5215 5215 166853
> atom:/d1/motion# find cam2|wc
> 5069 5069 162181
> atom:/d1/motion# find cam3|wc
> 5594 5594 178981
> atom:/d1/motion#

This sounds a lot like xfs simply filling up the directory index slots
of files that you just moved out with new files, and nfs falsely
claiming that this is a problem.

Any chance to figure out if the file you hit the printk with was one
that got either recently added or moved when you hit it? (I can't
follow the nfs code enough to check if it prints the first or second hit
of the same cookie)

2011-07-27 17:17:35

by Justin Piszcz

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop



On Wed, 27 Jul 2011, Ruediger Meier wrote:

> On Wednesday 27 July 2011, Bryan Schumaker wrote:
>> On 07/27/2011 12:28 PM, Justin Piszcz wrote:
>>> On Wed, 27 Jul 2011, J. Bruce Fields wrote:
>>>>
>>>> What filesystem on the server are you exporting?
>>>
>>> xfs.
>>> /dev/sda1 on / type xfs (rw,noatime)
>>>
>>> Nothing special, thoughts?
>>
>> Are there a lot of files in the directory you're exporting? It looks
>> like cookie 10272 is mapped to multiple files.
>
> I thought xfs is immune to readdir loops!?
> Is your export directory really located directly within / on /dev/sda1?

Hi,

I was sharing out a directory on the NFS server:
/d1 192.168.0.0/24(async,rw,no_root_squash,no_subtree_check,fsid=1)

Should I share out / instead?
Is this a known problem?

$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 30G 13G 18G 43% /
tmpfs 2.0G 8.0K 2.0G 1% /lib/init/rw
udev 10M 192K 9.9M 2% /dev
tmpfs 2.0G 0 2.0G 0% /dev/shm
$

Justin.



2011-07-27 17:09:41

by Anna Schumaker

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On 07/27/2011 01:00 PM, Ruediger Meier wrote:
> On Wednesday 27 July 2011, Bryan Schumaker wrote:
>> On 07/27/2011 12:28 PM, Justin Piszcz wrote:
>>> On Wed, 27 Jul 2011, J. Bruce Fields wrote:
>>>>
>>>> What filesystem on the server are you exporting?
>>>
>>> xfs.
>>> /dev/sda1 on / type xfs (rw,noatime)
>>>
>>> Nothing special, thoughts?
>>
>> Are there a lot of files in the directory you're exporting? It looks
>> like cookie 10272 is mapped to multiple files.
>
> I thought xfs is immune to readdir loops!?

I guess that depends how it generates the cookie... I want to try out the ext4 patches that were posted earlier today. I'll double check xfs while I'm at it.

- Bryan

> Is your export directory really located directly within / on /dev/sda1?
>
> cu,
> Rudi


2011-07-27 17:45:42

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 01:17:35PM -0400, Justin Piszcz wrote:
>
>
> On Wed, 27 Jul 2011, Ruediger Meier wrote:
>
> >On Wednesday 27 July 2011, Bryan Schumaker wrote:
> >>On 07/27/2011 12:28 PM, Justin Piszcz wrote:
> >>>On Wed, 27 Jul 2011, J. Bruce Fields wrote:
> >>>>
> >>>>What filesystem on the server are you exporting?
> >>>
> >>>xfs.
> >>>/dev/sda1 on / type xfs (rw,noatime)
> >>>
> >>>Nothing special, thoughts?
> >>
> >>Are there a lot of files in the directory you're exporting? It looks
> >>like cookie 10272 is mapped to multiple files.
> >
> >I thought xfs is immune to readdir loops!?
> >Is your export directory really located directly within / on /dev/sda1?
>
> Hi,
>
> I was sharing out a directory on the NFS server:
> /d1 192.168.0.0/24(async,rw,no_root_squash,no_subtree_check,fsid=1)
>
> Should I share out / instead?

You can do that if you want, but note that anyone malicious on that
network can get access to / by guessing filehandles. (Safer would be to
mount a separate partition at /d1.)

But in any case that's got nothing to do with readdir cookie problems.

--b.

> Is this a known problem?
>
> $ df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda1 30G 13G 18G 43% /
> tmpfs 2.0G 8.0K 2.0G 1% /lib/init/rw
> udev 10M 192K 9.9M 2% /dev
> tmpfs 2.0G 0 2.0G 0% /dev/shm
> $
>
> Justin.
>
>

2011-07-27 17:15:54

by Justin Piszcz

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop



On Wed, 27 Jul 2011, Bryan Schumaker wrote:

> On 07/27/2011 12:28 PM, Justin Piszcz wrote:
>>
>>
>> On Wed, 27 Jul 2011, J. Bruce Fields wrote:
>>
>>> On Wed, Jul 27, 2011 at 09:54:09AM -0400, Justin Piszcz wrote:
>>>> Hi,
>>>>
>>>> Kernel 2.6.30 on client.
>>>> Kernel 2.6.28 on server.
>>>>
>>>> p34 kernel: [92223.918892] NFS: directory motion/cam2 contains a
>>>> readdir loop. Please contact your server vendor. Offending cookie: 10272
>>>
>>> What filesystem on the server are you exporting?
>>
>> Hi,
>>
>> xfs.
>> /dev/sda1 on / type xfs (rw,noatime)
>>
>> Nothing special, thoughts?
>
> Are there a lot of files in the directory you're exporting? It looks like cookie 10272 is mapped to multiple files. When the client tries to resume reading from this cookie, xfs will reply from the first matching file and cause the client to enter a loop.

Should I be using a different filesystem?

user@atom:/d1$ cd /d1/motion/cam1
user@atom:/d1/motion/cam1$ ls|wc
5198 5198 140346
user@atom:/d1/motion/cam1$

Justin.



2011-07-27 17:00:14

by Rüdiger Meier

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wednesday 27 July 2011, Bryan Schumaker wrote:
> On 07/27/2011 12:28 PM, Justin Piszcz wrote:
> > On Wed, 27 Jul 2011, J. Bruce Fields wrote:
> >>
> >> What filesystem on the server are you exporting?
> >
> > xfs.
> > /dev/sda1 on / type xfs (rw,noatime)
> >
> > Nothing special, thoughts?
>
> Are there a lot of files in the directory you're exporting? It looks
> like cookie 10272 is mapped to multiple files.

I thought xfs is immune to readdir loops!?
Is your export directory really located directly within / on /dev/sda1?

cu,
Rudi

2011-07-27 20:47:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 10:26:55PM +0200, R?diger Meier wrote:
> At the time I've started this thread
> http://comments.gmane.org/gmane.linux.nfs/40863
> I had the feeling that the readdir cache changings in 2.6.37 have
> something to do with these loop problems.
>
> After that thread I've accepted that's a general problem with
> ext4/dirindex and nfs but seeing it again on xfs with just 5000 files
> I'm in doubt again.

Two separate issues. For one thing the nfs code simply doesn't seem
to handle changing directories very well, and one and a half the Linux
NFS server might even send incoherent readdir output in a single
protocol reply.

Issue two is that the ext3/4 hashed directory format is too simply (not
to say dumb) to provide a proper 32-bit linear value for the dirent
d_off field. It's not a complex task, and the first relatively simple
generation of xfs btree directories couldn't handle it either. The
v2 directory format handles it fine, but at the cost of a much more
complex codebase.


2011-07-27 20:54:27

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, 2011-07-27 at 16:37 -0400, Trond Myklebust wrote:
> On Wed, 2011-07-27 at 15:47 -0400, Christoph Hellwig wrote:
> > On Wed, Jul 27, 2011 at 03:44:20PM -0400, Justin Piszcz wrote:
> > >
> > >
> > > On Wed, 27 Jul 2011, Christoph Hellwig wrote:
> > >
> > > >On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
> > > >>Currently I do not see any dupes, however I have a script that moves
> > > >>images out of the directory once an hour:
> > > >>0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
> > > >
> > > >Do you keep adding files to the directory while you move files out?
> > > Yes, otherwise there are too many files in the directory and viewers, e.g.,
> > > each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
> > > it around 5,000 pictures or less.
> > >
> > > >What's the rate of additions/removals to the directory?
> > > Additions it depends, around 5,000 over a 12hr period, 416/hr, current:
> > >
> > > atom:/d1/motion# find cam1|wc
> > > 5215 5215 166853
> > > atom:/d1/motion# find cam2|wc
> > > 5069 5069 162181
> > > atom:/d1/motion# find cam3|wc
> > > 5594 5594 178981
> > > atom:/d1/motion#
> >
> > This sounds a lot like xfs simply filling up the directory index slots
> > of files that you just moved out with new files, and nfs falsely
> > claiming that this is a problem.
>
> Yep. There is an existing bugzilla report for this bug at
>
> https://bugzilla.kernel.org/show_bug.cgi?id=38572
>
> I have a preliminary patch there that attempts to turn off the loop
> detection when the directory is seen to change, however that patch still
> appears to have a bug in it, and I haven't had time to figure out what
> is wrong yet.
>
> Can you perhaps take a look, Bryan?

Actually, Justin, can you test the following slight variant on the patch
in the bugzilla?

8<---------------------------------------------------------

2011-07-27 19:57:09

by Justin Piszcz

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop



On Wed, 27 Jul 2011, Christoph Hellwig wrote:

> On Wed, Jul 27, 2011 at 03:44:20PM -0400, Justin Piszcz wrote:
>>
>>
>> On Wed, 27 Jul 2011, Christoph Hellwig wrote:
>>
>>> On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
>>>> Currently I do not see any dupes, however I have a script that moves
>>>> images out of the directory once an hour:
>>>> 0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
>>>
>>> Do you keep adding files to the directory while you move files out?
>> Yes, otherwise there are too many files in the directory and viewers, e.g.,
>> each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
>> it around 5,000 pictures or less.
>>
>>> What's the rate of additions/removals to the directory?
>> Additions it depends, around 5,000 over a 12hr period, 416/hr, current:
>>
>> atom:/d1/motion# find cam1|wc
>> 5215 5215 166853
>> atom:/d1/motion# find cam2|wc
>> 5069 5069 162181
>> atom:/d1/motion# find cam3|wc
>> 5594 5594 178981
>> atom:/d1/motion#
>
> This sounds a lot like xfs simply filling up the directory index slots
> of files that you just moved out with new files, and nfs falsely
> claiming that this is a problem.
>
> Any chance to figure out if the file you hit the printk with was one
> that got either recently added or moved when you hit it? (I can't
> follow the nfs code enough to check if it prints the first or second hit
> of the same cookie)
>

It seems to happen across all directories, these are from the past 24 hours.

[41901.041923] NFS: directory motion/cam2 contains a readdir loop. Please contact your server vendor. Offending cookie: 14368
[41901.275284] NFS: directory motion/cam3 contains a readdir loop. Please contact your server vendor. Offending cookie: 17435
[45497.265250] NFS: directory motion/cam1 contains a readdir loop. Please contact your server vendor. Offending cookie: 14488
[45498.832696] NFS: directory motion/cam1 contains a readdir loop. Please contact your server vendor. Offending cookie: 16416
[45507.812712] NFS: directory motion/cam2 contains a readdir loop. Please contact your server vendor. Offending cookie: 14778
[45508.458785] NFS: directory motion/cam2 contains a readdir loop. Please contact your server vendor. Offending cookie: 14778
[92223.918892] NFS: directory motion/cam2 contains a readdir loop. Please contact your server vendor. Offending cookie: 10272
[99413.259688] NFS: directory motion/cam1 contains a readdir loop. Please contact your server vendor. Offending cookie: 10272
[113791.004006] NFS: directory motion/cam1 contains a readdir loop. Please contact your server vendor. Offending cookie: 6848

Interestingly, I have two machines that perform this function, both XFS and it
only affects the client running 2.6.38:

$ df -h
2.6.38 - Has a kernel driver that was removed in 2.6.39 (rt2870sta) which
works really well.
atomw:/d1 30G 13G 18G 43% /nfs/atomw/d1

2.6.39:
d630w:/d1 75G 2.6G 72G 4% /nfs/d630w/d1

However, to rule out any kernel issues I'll try 3.0 and see if the problem recurs with a newer version as it is _NOT_ happening with 2.6.39 (similar setup) on both; however:

d630 => 32bit installation (core2duo t7500)
atomw => 64-bit atom

Justin.


2011-07-27 20:02:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 03:54:49PM -0400, Bryan Schumaker wrote:
> > Any chance to figure out if the file you hit the printk with was one
> > that got either recently added or moved when you hit it? (I can't
> > follow the nfs code enough to check if it prints the first or second hit
> > of the same cookie)
>
> It should be printing on the second hit of a cookie.

But looking closer at it it only prints the directory name and not that
of any of the matching cookies, making it pretty useless to debug any
problem. (and it makes my previous question to Justin look stupid..).


But so far I still stick to my previous theory that this sounds like
a directory offset getting reused. How is cache invalidation for
the array supposed to work? And maybe more importantly, given that he
can only reproduce it with a .38 client did any bugs get fixed in that
code recently that might lead to issues with the cache invalidation?


2011-07-27 16:40:17

by Anna Schumaker

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On 07/27/2011 12:28 PM, Justin Piszcz wrote:
>
>
> On Wed, 27 Jul 2011, J. Bruce Fields wrote:
>
>> On Wed, Jul 27, 2011 at 09:54:09AM -0400, Justin Piszcz wrote:
>>> Hi,
>>>
>>> Kernel 2.6.30 on client.
>>> Kernel 2.6.28 on server.
>>>
>>> p34 kernel: [92223.918892] NFS: directory motion/cam2 contains a
>>> readdir loop. Please contact your server vendor. Offending cookie: 10272
>>
>> What filesystem on the server are you exporting?
>
> Hi,
>
> xfs.
> /dev/sda1 on / type xfs (rw,noatime)
>
> Nothing special, thoughts?

Are there a lot of files in the directory you're exporting? It looks like cookie 10272 is mapped to multiple files. When the client tries to resume reading from this cookie, xfs will reply from the first matching file and cause the client to enter a loop.

- Bryan

>
> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-07-27 18:11:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

Justin,

can you please run the attached test program on the affected directory
on the server, and see if you see duplicates in the d_off colum. Unless
you have privacy concerns I would also love to see the full output.


Attachments:
(No filename) (222.00 B)
getdents.c (1.12 kB)
Download all attachments

2011-07-27 19:39:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
> Currently I do not see any dupes, however I have a script that moves
> images out of the directory once an hour:
> 0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1

Do you keep adding files to the directory while you move files out?
What's the rate of additions/removals to the directory?

If we add files to the directory while removing others we could easily
re-use the same offset for a different file.


2011-07-27 18:28:48

by Anna Schumaker

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On 07/27/2011 01:00 PM, Ruediger Meier wrote:
> On Wednesday 27 July 2011, Bryan Schumaker wrote:
>> On 07/27/2011 12:28 PM, Justin Piszcz wrote:
>>> On Wed, 27 Jul 2011, J. Bruce Fields wrote:
>>>>
>>>> What filesystem on the server are you exporting?
>>>
>>> xfs.
>>> /dev/sda1 on / type xfs (rw,noatime)
>>>
>>> Nothing special, thoughts?
>>
>> Are there a lot of files in the directory you're exporting? It looks
>> like cookie 10272 is mapped to multiple files.
>
> I thought xfs is immune to readdir loops!?

I can ls a directory with 500,000 files over nfs4. That's usually enough to cause the readdir loop in ext4, so I guess this is a different problem.

> Is your export directory really located directly within / on /dev/sda1?
>
> cu,
> Rudi


2011-07-27 19:44:21

by Justin Piszcz

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop



On Wed, 27 Jul 2011, Christoph Hellwig wrote:

> On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
>> Currently I do not see any dupes, however I have a script that moves
>> images out of the directory once an hour:
>> 0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
>
> Do you keep adding files to the directory while you move files out?
Yes, otherwise there are too many files in the directory and viewers, e.g.,
each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
it around 5,000 pictures or less.

> What's the rate of additions/removals to the directory?
Additions it depends, around 5,000 over a 12hr period, 416/hr, current:

atom:/d1/motion# find cam1|wc
5215 5215 166853
atom:/d1/motion# find cam2|wc
5069 5069 162181
atom:/d1/motion# find cam3|wc
5594 5594 178981
atom:/d1/motion#

>
> If we add files to the directory while removing others we could easily
> re-use the same offset for a different file.
>

Justin.


2011-07-27 20:56:36

by Myklebust, Trond

[permalink] [raw]
Subject: Re: 2.6.xx: NFS: directory motion/cam2 contains a readdir loop

On Wed, 2011-07-27 at 16:54 -0400, Trond Myklebust wrote:
> On Wed, 2011-07-27 at 16:37 -0400, Trond Myklebust wrote:
> > On Wed, 2011-07-27 at 15:47 -0400, Christoph Hellwig wrote:
> > > On Wed, Jul 27, 2011 at 03:44:20PM -0400, Justin Piszcz wrote:
> > > >
> > > >
> > > > On Wed, 27 Jul 2011, Christoph Hellwig wrote:
> > > >
> > > > >On Wed, Jul 27, 2011 at 03:35:01PM -0400, Justin Piszcz wrote:
> > > > >>Currently I do not see any dupes, however I have a script that moves
> > > > >>images out of the directory once an hour:
> > > > >>0 * * * * /usr/local/bin/move_to_old2.sh > /dev/null 2>&1
> > > > >
> > > > >Do you keep adding files to the directory while you move files out?
> > > > Yes, otherwise there are too many files in the directory and viewers, e.g.,
> > > > each geeqie (picture viewer) will use > 4-6GB of memory, so I try to keep
> > > > it around 5,000 pictures or less.
> > > >
> > > > >What's the rate of additions/removals to the directory?
> > > > Additions it depends, around 5,000 over a 12hr period, 416/hr, current:
> > > >
> > > > atom:/d1/motion# find cam1|wc
> > > > 5215 5215 166853
> > > > atom:/d1/motion# find cam2|wc
> > > > 5069 5069 162181
> > > > atom:/d1/motion# find cam3|wc
> > > > 5594 5594 178981
> > > > atom:/d1/motion#
> > >
> > > This sounds a lot like xfs simply filling up the directory index slots
> > > of files that you just moved out with new files, and nfs falsely
> > > claiming that this is a problem.
> >
> > Yep. There is an existing bugzilla report for this bug at
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=38572
> >
> > I have a preliminary patch there that attempts to turn off the loop
> > detection when the directory is seen to change, however that patch still
> > appears to have a bug in it, and I haven't had time to figure out what
> > is wrong yet.
> >
> > Can you perhaps take a look, Bryan?
>
> Actually, Justin, can you test the following slight variant on the patch
> in the bugzilla?

Doh! This one will actually compile....

> 8<---------------------------------------------------------