LinuxLists.cc - 21 million inodes is causing severe pauses.

2004-11-15 20:03:18

Subject: 21 million inodes is causing severe pauses.

The subject line is a little deceiving. That number comes from using
XFS on a 2.4 kernel. With a 2.6 kernel, we see problems similar to the
ones we are experiencing on 2.4, only less severe.

Digging into this some more, we determined the problem is the large number
of inodes and dentry items held. For a machine with 32GB of memory and
8 cpus doing build type activity, we have found it stabilizes at between
2 and 8 million entries.

One significant problem we are running into is autofs trying to umount the
file systems. This results in the umount grabbing the BKL and inode_lock,
holding it while it scans through the inode_list and others looking for
inodes used by this super block and attempting to free them.

We patched a SLES9 kernel with the patch found in the -mm tree which
attempts to address this problem by linking inodes off the sb structure.
This does make the umount somewhat quicker, but on a busy nfs mounted
filesystem, the BKL and inode_lock do still get in the way causing
frequent system pauses on the order of seconds. This is on a SLES9
kernel which we just put into a test production environment last Thursday.
By 8:00 AM Friday, the system was unusable.

Additionally, we experience NULL pointer dereferences during
remove_inode_buffers. I have not looked for additional patches in the
-mm tree to address that problem.

While discussing this in the hallway, we have come up with a few possible
alternatives.

1) Have the dentry and inode sizes limited on a per sb basis
with a mount option as an override for the default setting.

2) Have the vfs limit dentry and inode cache sizes based on
slab usage (ie, nfs, ext2, and xfs slab sizes are limited independently
of each other.

3) Have the vfs limit it based on total inode_list entries.

We are not sure which if any is the right direction to go at this time.
We are only hoping to start a discussion. Any guidance would be
appreciated.

Thank you,
Robin Holt

PS: The patch referred to above is:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109474397830096&w=2

2004-11-15 20:41:09

by Norbert van Nobelen

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

It would help a lot if you provided logins to everybody on this list to this
desktop system of you (-:

Anyway: There is temporary solution, which will help a little in this case:
Increase the blocksize so that your total number of inodes decreases. Since
XFS is your own filesystem, you could calculate the results of this pretty
quick.

Using another filesystem could help too, but I don't have any comparison bases
for this since the largest system we have here at this moment only has 1 TB
of diskspace.
I will monitor this thread though, we are in tender for a somewhat larger
project in which problems like this could become our problem too.

On Monday 15 November 2004 20:55, you wrote:
> The subject line is a little deceiving. That number comes from using
> XFS on a 2.4 kernel. With a 2.6 kernel, we see problems similar to the
> ones we are experiencing on 2.4, only less severe.
>
> Digging into this some more, we determined the problem is the large number
> of inodes and dentry items held. For a machine with 32GB of memory and
> 8 cpus doing build type activity, we have found it stabilizes at between
> 2 and 8 million entries.
>
> One significant problem we are running into is autofs trying to umount the
> file systems. This results in the umount grabbing the BKL and inode_lock,
> holding it while it scans through the inode_list and others looking for
> inodes used by this super block and attempting to free them.
>
> We patched a SLES9 kernel with the patch found in the -mm tree which
> attempts to address this problem by linking inodes off the sb structure.
> This does make the umount somewhat quicker, but on a busy nfs mounted
> filesystem, the BKL and inode_lock do still get in the way causing
> frequent system pauses on the order of seconds. This is on a SLES9
> kernel which we just put into a test production environment last Thursday.
> By 8:00 AM Friday, the system was unusable.
>
> Additionally, we experience NULL pointer dereferences during
> remove_inode_buffers. I have not looked for additional patches in the
> -mm tree to address that problem.
>
> While discussing this in the hallway, we have come up with a few possible
> alternatives.
>
> 1) Have the dentry and inode sizes limited on a per sb basis
> with a mount option as an override for the default setting.
>
> 2) Have the vfs limit dentry and inode cache sizes based on
> slab usage (ie, nfs, ext2, and xfs slab sizes are limited independently
> of each other.
>
> 3) Have the vfs limit it based on total inode_list entries.
>
> We are not sure which if any is the right direction to go at this time.
> We are only hoping to start a discussion. Any guidance would be
> appreciated.
>
> Thank you,
> Robin Holt
>
> PS: The patch referred to above is:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109474397830096&w=2
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-11-15 20:54:22

by Robin Holt

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

On Mon, Nov 15, 2004 at 09:35:35PM +0100, Norbert van Nobelen wrote:
> It would help a lot if you provided logins to everybody on this list to this
> desktop system of you (-:

That would not be possible. The setup is not that uniq. It is basically
a SLES9 system with 4 300GB filesystem locally mounted and a large number
of autofs mounted home directories etc. I can provide diagnostic counters
if they will be helpful.

>
> Anyway: There is temporary solution, which will help a little in this case:
> Increase the blocksize so that your total number of inodes decreases. Since
> XFS is your own filesystem, you could calculate the results of this pretty
> quick.

The problem is not isolated to XFS. Even with smaller blocksizes, we
still have a large number of files being referenced which will result
in many inodes. We saw the same behavior with ext2/3 and NFS. I don't
think it has anything to do with XFS.

>
> Using another filesystem could help too, but I don't have any comparison bases
> for this since the largest system we have here at this moment only has 1 TB
> of diskspace.

You should have enough disk space. You will need a lot of memory as
well. You will also need a workload which puts some memory pressure on
periodically, but not that much pressure and not that often. This appears
to make the vm favor buffer cache to slab cache and allow the inode
cache to grow really large.

Thanks,
Robin Holt

2004-11-15 21:01:44

by linux-os

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

Another temporary fix is to do:

while true ; do sleep 5 ; sync ; done

... or some 'C' code equivalent to force most of the stuff to
disk before it takes so much time that it's obvious to the
users.

If you have soooo much data buffered, it is going to take a
verrrry long time to write it to disk so. Just write it before
you have so much buffered!

NULL pointer problems shouldn't happen. However, you don't say
if its a kernel crash problem or a user-mode problem. If it's
a user-mode problem, the possibility exists that somebody isn't
properly checking the return value of read/write, etc. If EIO
(from attempting to modify an inode) was return in errno, you
get -1 in the return value, it that's used as an index into the
next bunch of data, you are dorked.

On Mon, 15 Nov 2004, Norbert van Nobelen wrote:

> It would help a lot if you provided logins to everybody on this list to this
> desktop system of you (-:
>
> Anyway: There is temporary solution, which will help a little in this case:
> Increase the blocksize so that your total number of inodes decreases. Since
> XFS is your own filesystem, you could calculate the results of this pretty
> quick.
>
> Using another filesystem could help too, but I don't have any comparison bases
> for this since the largest system we have here at this moment only has 1 TB
> of diskspace.
> I will monitor this thread though, we are in tender for a somewhat larger
> project in which problems like this could become our problem too.
>
> On Monday 15 November 2004 20:55, you wrote:
>> The subject line is a little deceiving. That number comes from using
>> XFS on a 2.4 kernel. With a 2.6 kernel, we see problems similar to the
>> ones we are experiencing on 2.4, only less severe.
>>
>> Digging into this some more, we determined the problem is the large number
>> of inodes and dentry items held. For a machine with 32GB of memory and
>> 8 cpus doing build type activity, we have found it stabilizes at between
>> 2 and 8 million entries.
>>
>> One significant problem we are running into is autofs trying to umount the
>> file systems. This results in the umount grabbing the BKL and inode_lock,
>> holding it while it scans through the inode_list and others looking for
>> inodes used by this super block and attempting to free them.
>>
>> We patched a SLES9 kernel with the patch found in the -mm tree which
>> attempts to address this problem by linking inodes off the sb structure.
>> This does make the umount somewhat quicker, but on a busy nfs mounted
>> filesystem, the BKL and inode_lock do still get in the way causing
>> frequent system pauses on the order of seconds. This is on a SLES9
>> kernel which we just put into a test production environment last Thursday.
>> By 8:00 AM Friday, the system was unusable.
>>
>> Additionally, we experience NULL pointer dereferences during
>> remove_inode_buffers. I have not looked for additional patches in the
>> -mm tree to address that problem.
>>
>> While discussing this in the hallway, we have come up with a few possible
>> alternatives.
>>
>> 1) Have the dentry and inode sizes limited on a per sb basis
>> with a mount option as an override for the default setting.
>>
>> 2) Have the vfs limit dentry and inode cache sizes based on
>> slab usage (ie, nfs, ext2, and xfs slab sizes are limited independently
>> of each other.
>>
>> 3) Have the vfs limit it based on total inode_list entries.
>>
>> We are not sure which if any is the right direction to go at this time.
>> We are only hoping to start a discussion. Any guidance would be
>> appreciated.
>>
>> Thank you,
>> Robin Holt
>>
>> PS: The patch referred to above is:
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=109474397830096&w=2
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by John Ashcroft.
98.36% of all statistics are fiction.

2004-11-15 21:35:42

by Robin Holt

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

On Mon, Nov 15, 2004 at 03:57:44PM -0500, linux-os wrote:
>
> Another temporary fix is to do:
>
> while true ; do sleep 5 ; sync ; done

I don't think we are looking at a flushing buffers to disk problem.
Even after doing a sync, I see 1.3M entries. Before the sync, I
was at 1.2M, so the count went up during the sync.

I am specifically noticing problems with the inode_list and not
buffers.

>
> ... or some 'C' code equivalent to force most of the stuff to
> disk before it takes so much time that it's obvious to the
> users.
>
> If you have soooo much data buffered, it is going to take a
> verrrry long time to write it to disk so. Just write it before
> you have so much buffered!

This is already being done. Nearly all of the inodes have
buffers that are expired and have been pushed to disk.

>
> NULL pointer problems shouldn't happen. However, you don't say
> if its a kernel crash problem or a user-mode problem. If it's
> a user-mode problem, the possibility exists that somebody isn't
> properly checking the return value of read/write, etc. If EIO
> (from attempting to modify an inode) was return in errno, you
> get -1 in the return value, it that's used as an index into the
> next bunch of data, you are dorked.

Kernel null pointer dereference in remove_inode_buffers().

Thanks,
Robin

2004-11-15 22:55:06

by Andrew Morton

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

Robin Holt <[email protected]> wrote:
>
> One significant problem we are running into is autofs trying to umount the
> file systems. This results in the umount grabbing the BKL and inode_lock,
> holding it while it scans through the inode_list and others looking for
> inodes used by this super block and attempting to free them.

You'll need invalidate_inodes-speedup.patch and
break-latency-in-invalidate_list.patch (or an equivalent).

That'll get you most of the way, but the BKL will still be a problem.

Removing lock_kernel() in the umount path is probably a major project so
for now, you can just drop and reacquire it by doing
release_kernel_lock()/reacquire_kernel_lock() around invalidate_inodes().

(You'll need to use that pair rather than unlock_kernel/lock_kernel because
it seems that invalidate_inodes can be called under various depths of
lock_kernel()).

2004-11-16 16:31:29

by Robin Holt

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

On Mon, Nov 15, 2004 at 02:57:14PM -0800, Andrew Morton wrote:
> Robin Holt <[email protected]> wrote:
> >
> > One significant problem we are running into is autofs trying to umount the
> > file systems. This results in the umount grabbing the BKL and inode_lock,
> > holding it while it scans through the inode_list and others looking for
> > inodes used by this super block and attempting to free them.
>
> You'll need invalidate_inodes-speedup.patch and
> break-latency-in-invalidate_list.patch (or an equivalent).
>

I added the break-latency-in-invalidate_list.patch to the SLES9 kernel.
I am running the test again, but do not see how that change can do anything
to eliminate the race condition which appears to leave me with a NULL
pointer. I will dig into that more today if other obligations allow it.

> That'll get you most of the way, but the BKL will still be a problem.
>
> Removing lock_kernel() in the umount path is probably a major project so
> for now, you can just drop and reacquire it by doing
> release_kernel_lock()/reacquire_kernel_lock() around invalidate_inodes().

I guess I am very concerned at this point. If I can do a
release/reacquire, why not just change generic_shutdown_super() so the
lock_kernel() does not happen until the first pass has occurred. ie:

--- super.c.orig 2004-11-16 10:22:17 -06:00
+++ super.c 2004-11-16 10:22:41 -06:00
@@ -232,10 +232,10 @@
dput(root);
fsync_super(sb);
lock_super(sb);
- lock_kernel();
sb->s_flags &= ~MS_ACTIVE;
/* bad name - it should be evict_inodes() */
invalidate_inodes(sb);
+ lock_kernel();

if (sop->write_super && sb->s_dirt)
sop->write_super(sb);

This at least makes the lock_kernel time much smaller than it is right
now. It also does not affect any callers that may really need the BKL.

I guess I am really asking for an indication of what the BKL is supposed
to be protecting. I have not dug for the intent down the VFS code paths
at all.

2004-11-16 16:35:54

by Robin Holt

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

2004-11-16 19:16:34

by Andrew Morton

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

Robin Holt <[email protected]> wrote:
>
> I guess I am very concerned at this point. If I can do a
> release/reacquire, why not just change generic_shutdown_super() so the
> lock_kernel() does not happen until the first pass has occurred. ie:
>
> --- super.c.orig 2004-11-16 10:22:17 -06:00
> +++ super.c 2004-11-16 10:22:41 -06:00
> @@ -232,10 +232,10 @@
> dput(root);
> fsync_super(sb);
> lock_super(sb);
> - lock_kernel();
> sb->s_flags &= ~MS_ACTIVE;
> /* bad name - it should be evict_inodes() */
> invalidate_inodes(sb);
> + lock_kernel();
>
> if (sop->write_super && sb->s_dirt)
> sop->write_super(sb);
>
> This at least makes the lock_kernel time much smaller than it is right
> now. It also does not affect any callers that may really need the BKL.

lock_kernel() is also taken way up in do_umount(), hence the need for
release_kernel_lock()/reacquire_kernel_lock().

>
> I guess I am really asking for an indication of what the BKL is supposed
> to be protecting. I have not dug for the intent down the VFS code paths
> at all.

It's not protecting anything around invalidate_inodes(). There may be
other things in the higher-level umount path which need it.

2004-11-16 19:18:26

by Andrew Morton

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

Robin Holt <[email protected]> wrote:
>
> On Mon, Nov 15, 2004 at 02:57:14PM -0800, Andrew Morton wrote:
> > Robin Holt <[email protected]> wrote:
> > >
> > > One significant problem we are running into is autofs trying to umount the
> > > file systems. This results in the umount grabbing the BKL and inode_lock,
> > > holding it while it scans through the inode_list and others looking for
> > > inodes used by this super block and attempting to free them.
> >
> > You'll need invalidate_inodes-speedup.patch and
> > break-latency-in-invalidate_list.patch (or an equivalent).
>
> With these patches and a new test where I periodically put on mild
> but diminishing memory pressure, I have been able to get the number
> of inodes up to 31 Million. I would really like to find a way to
> reduce limit the number of inodes or am I seeing a problem where
> none exists? After putting on constant mild memory pressure, I have
> seen then number of inodes stabilize at 17-18 Million.

There shouldn't be anything wrong with that per-se. You have a big system,
and a workload which touches a lot of inodes. It would be best to fix up
the lock hold times rather than reducing the cached inode count because the
lock hold times are sucky.

That being said, you may get some joy by greatly increasing
/proc/sys/vm/vfs_cache_pressure. Try 10000.

2004-11-16 20:03:25

by Robin Holt

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

On Tue, Nov 16, 2004 at 11:13:21AM -0800, Andrew Morton wrote:
> Robin Holt <[email protected]> wrote:
> >
> > I guess I am very concerned at this point. If I can do a
> > release/reacquire, why not just change generic_shutdown_super() so the
> > lock_kernel() does not happen until the first pass has occurred. ie:
> >
> > --- super.c.orig 2004-11-16 10:22:17 -06:00
> > +++ super.c 2004-11-16 10:22:41 -06:00
> > @@ -232,10 +232,10 @@
> > dput(root);
> > fsync_super(sb);
> > lock_super(sb);
> > - lock_kernel();
> > sb->s_flags &= ~MS_ACTIVE;
> > /* bad name - it should be evict_inodes() */
> > invalidate_inodes(sb);
> > + lock_kernel();
> >
> > if (sop->write_super && sb->s_dirt)
> > sop->write_super(sb);
> >
> > This at least makes the lock_kernel time much smaller than it is right
> > now. It also does not affect any callers that may really need the BKL.
>
> lock_kernel() is also taken way up in do_umount(), hence the need for
> release_kernel_lock()/reacquire_kernel_lock().

It looks like it is only held very briefly during the early parts of do_umount.

I have moved lock_kernel() as above in addition to the two patches you pointed
to earlier. This has left me with a system which has 21M inodes and undetectable
delays during heavy mount/umount activity. I am starting one last test which
attempts a umount of a filesystem which has many inodes associated with it.

At this point, I have checked the entire code path and see no reason the
BKL is held for the first call to invalidate_inodes.

2004-11-17 00:34:22

by Andrew Morton

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

Robin Holt <[email protected]> wrote:
>
> On Tue, Nov 16, 2004 at 11:13:21AM -0800, Andrew Morton wrote:
> > Robin Holt <[email protected]> wrote:
> > >
> > > I guess I am very concerned at this point. If I can do a
> > > release/reacquire, why not just change generic_shutdown_super() so the
> > > lock_kernel() does not happen until the first pass has occurred. ie:
> > >
> > > --- super.c.orig 2004-11-16 10:22:17 -06:00
> > > +++ super.c 2004-11-16 10:22:41 -06:00
> > > @@ -232,10 +232,10 @@
> > > dput(root);
> > > fsync_super(sb);
> > > lock_super(sb);
> > > - lock_kernel();
> > > sb->s_flags &= ~MS_ACTIVE;
> > > /* bad name - it should be evict_inodes() */
> > > invalidate_inodes(sb);
> > > + lock_kernel();
> > >
> > > if (sop->write_super && sb->s_dirt)
> > > sop->write_super(sb);
> > >
> > > This at least makes the lock_kernel time much smaller than it is right
> > > now. It also does not affect any callers that may really need the BKL.
> >
> > lock_kernel() is also taken way up in do_umount(), hence the need for
> > release_kernel_lock()/reacquire_kernel_lock().
>
> It looks like it is only held very briefly during the early parts of do_umount.
>

OK.

> I have moved lock_kernel() as above in addition to the two patches you pointed
> to earlier. This has left me with a system which has 21M inodes and undetectable
> delays during heavy mount/umount activity. I am starting one last test which
> attempts a umount of a filesystem which has many inodes associated with it.

That sounds good. We need to work out where that null-pointer deref is
coming from.

> At this point, I have checked the entire code path and see no reason the
> BKL is held for the first call to invalidate_inodes.

No, the above change looks fine. And I have no problem merging up
invalidate_inodes-speedup.patch, really - it's been in -mm for over a year.
I've just been waiting for a decent reason to merge it.

2004-11-17 00:55:06

by Robin Holt

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

On Tue, Nov 16, 2004 at 04:33:10PM -0800, Andrew Morton wrote:
> > I have moved lock_kernel() as above in addition to the two patches you pointed
> > to earlier. This has left me with a system which has 21M inodes and undetectable
> > delays during heavy mount/umount activity. I am starting one last test which
> > attempts a umount of a filesystem which has many inodes associated with it.
>
> That sounds good. We need to work out where that null-pointer deref is
> coming from.
>

I had originally used a patch that was in one of our developers trees
which was different from the one in your most recent -mm tree. The patch
I was using had a missing list_del(... i_list); or something like that.
I could dig out the exact line if you want to sanity check me, but I
don't have it handy right now. With the patch in your tree, I have not
seen any NULL pointer dereferences.

> > At this point, I have checked the entire code path and see no reason the
> > BKL is held for the first call to invalidate_inodes.
>
> No, the above change looks fine. And I have no problem merging up
> invalidate_inodes-speedup.patch, really - it's been in -mm for over a year.
> I've just been waiting for a decent reason to merge it.

I would strongly encourage merging the three patches we have talked about
here. I understand you would typically keep my BKL patch in your tree for
awhile and think that would be just fine. The changes only affect systems
that have filesystems being unmounted with a large number of inodes.

With the two patches already in your tree, the pauses are greatly reduced
for the autofs case that originally got me looking.

Thanks,
Robin Holt

2004-11-17 01:06:28

by Andrew Morton

[permalink] [raw]

Subject: Re: 21 million inodes is causing severe pauses.

Robin Holt <[email protected]> wrote:
>
> > > At this point, I have checked the entire code path and see no reason the
> > > BKL is held for the first call to invalidate_inodes.
> >
> > No, the above change looks fine. And I have no problem merging up
> > invalidate_inodes-speedup.patch, really - it's been in -mm for over a year.
> > I've just been waiting for a decent reason to merge it.
>
> I would strongly encourage merging the three patches we have talked about
> here. I understand you would typically keep my BKL patch in your tree for
> awhile and think that would be just fine. The changes only affect systems
> that have filesystems being unmounted with a large number of inodes.
>
> With the two patches already in your tree, the pauses are greatly reduced
> for the autofs case that originally got me looking.

OK. It's a bit late for 2.6.10 but I was planning on slurping the whole
lot into 2.6.11 anyway.