2009-05-03 10:56:02

by Fengguang Wu

[permalink] [raw]
Subject: very slow NFS boot on linux-next and -mm

Hi,

My NFSROOT box sometimes will get stuck for a dozen seconds during boot.

Here is the stucked process:

[ 48.983671] gzip D 0000000000000000 2880 2976 2966
[ 48.983671] ffff88007b087ae8 0000000000000046 0000000000000000 ffffffff81079c58
[ 48.983671] 00000000001d1a00 000000000000cda8 ffff88007b87dc80 ffffffff817934a0
[ 48.983671] ffff88007b87e010 ffff880005bd1a00 000000007b087a98 ffff880005bd1a18
[ 48.983671] Call Trace:
[ 48.983671] [<ffffffff81079c58>] ? mark_held_locks+0x68/0x90
[ 48.983671] [<ffffffff810709cd>] ? getnstimeofday+0x9d/0x140
[ 48.983671] [<ffffffff8152cb85>] schedule+0x15/0x50
[ 48.983671] [<ffffffff8152cc28>] io_schedule+0x68/0xb0
[ 48.983671] [<ffffffff810c1935>] sync_page+0x85/0xc0
[ 48.983671] [<ffffffff8152d482>] __wait_on_bit+0x62/0x90
[ 48.983671] [<ffffffff810c18b0>] ? sync_page+0x0/0xc0
[ 48.983671] [<ffffffff810c1c64>] wait_on_page_bit+0x74/0x80
[ 48.983671] [<ffffffff81066490>] ? wake_bit_function+0x0/0x50
[ 48.983671] [<ffffffff810cd5d7>] ? pagevec_lookup_tag+0x27/0x40
[ 48.983671] [<ffffffff810c2906>] wait_on_page_writeback_range+0xc6/0x150
[ 48.983671] [<ffffffff810cb600>] ? do_writepages+0x30/0x50
[ 48.983671] [<ffffffff810c2a29>] ? __filemap_fdatawrite_range+0x59/0x70
[ 48.983671] [<ffffffff810c29be>] filemap_fdatawait+0x2e/0x40
[ 48.983671] [<ffffffff810c3e8d>] filemap_write_and_wait+0x4d/0x60
[ 48.983671] [<ffffffff8118de75>] nfs_setattr+0x135/0x150
[ 48.983671] [<ffffffff81070031>] ? current_kernel_time+0x41/0x70
[ 48.983671] [<ffffffff81079c58>] ? mark_held_locks+0x68/0x90
[ 48.983671] [<ffffffff810512d7>] ? current_fs_time+0x27/0x30
[ 48.983671] [<ffffffff81076dc0>] ? debug_mutex_free_waiter+0x80/0xc0
[ 48.983671] [<ffffffff8111bce1>] notify_change+0x121/0x330
[ 48.983671] [<ffffffff81129993>] utimes_common+0xd3/0x1c0
[ 48.983671] [<ffffffff81129b7c>] do_utimes+0xfc/0x110
[ 48.983671] [<ffffffff81079fad>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 48.983671] [<ffffffff81129cbc>] sys_utimensat+0x3c/0xb0
[ 48.983671] [<ffffffff8100c2b2>] system_call_fastpath+0x16/0x1b

The NFSROOT client that got stuck is running latest linux-next, the
server side is running 2.6.30-rc3.

Switching the client kernel to latest -mm makes it very reproducible.
Here is another stack dump:

[ 180.399845] udevd D ffff8800280291e0 3056 1388 1
[ 180.399845] ffff88007bdeb7b8 0000000000000046 ffff88007bdec000 ffff88007bdec7d0
[ 180.399845] ffff88007bdeb718 00000000001d1000 000000000000cd68 ffff88007bdec000
[ 180.399845] ffff88007d3dc000 ffff88007bdec378 0000000100000286 00000000ffff8a5d
[ 180.399845] Call Trace:
[ 180.399845] [<ffffffff8107673d>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 180.399845] [<ffffffff8107679d>] ? trace_hardirqs_on+0xd/0x10
[ 180.399845] [<ffffffff814e3071>] rpc_wait_bit_killable+0x21/0x40
[ 180.399845] [<ffffffff81549552>] __wait_on_bit+0x62/0x90
[ 180.399845] [<ffffffff814e3050>] ? rpc_wait_bit_killable+0x0/0x40
[ 180.399845] [<ffffffff814e3050>] ? rpc_wait_bit_killable+0x0/0x40
[ 180.399845] [<ffffffff815495f9>] out_of_line_wait_on_bit+0x79/0x90
[ 180.399845] [<ffffffff81063270>] ? wake_bit_function+0x0/0x50
[ 180.399845] [<ffffffff814e3965>] __rpc_execute+0x105/0x2b0
[ 180.399845] [<ffffffff814e3b3d>] rpc_execute+0x2d/0x40
[ 180.399845] [<ffffffff814dc5ea>] rpc_run_task+0x4a/0xc0
[ 180.399845] [<ffffffff814dc772>] rpc_call_sync+0x42/0x70
[ 180.399845] [<ffffffff811c6bdd>] nfs3_rpc_wrapper+0x2d/0x70
[ 180.399845] [<ffffffff811c73af>] nfs3_proc_getattr+0x4f/0x90
[ 180.399845] [<ffffffff811b8e26>] __nfs_revalidate_inode+0xc6/0x220
[ 180.399845] [<ffffffff81012ad0>] ? native_sched_clock+0x20/0x70
[ 180.399845] [<ffffffff81012b29>] ? sched_clock+0x9/0x10
[ 180.399845] [<ffffffff81074405>] ? lock_release_holdtime+0x35/0x1c0
[ 180.399845] [<ffffffff811b33fe>] nfs_lookup_revalidate+0x41e/0x580
[ 180.399845] [<ffffffff81012ad0>] ? native_sched_clock+0x20/0x70
[ 180.399845] [<ffffffff81012ad0>] ? native_sched_clock+0x20/0x70
[ 180.399845] [<ffffffff81012b29>] ? sched_clock+0x9/0x10
[ 180.399845] [<ffffffff81074405>] ? lock_release_holdtime+0x35/0x1c0
[ 180.399845] [<ffffffff81116d42>] ? __d_lookup+0x142/0x210
[ 180.399845] [<ffffffff81116c00>] ? __d_lookup+0x0/0x210
[ 180.399845] [<ffffffff8110bb77>] do_lookup+0x67/0x260
[ 180.399845] [<ffffffff811b60c8>] ? nfs_permission+0x178/0x1e0
[ 180.399845] [<ffffffff8110e3c4>] __link_path_walk+0x9a4/0x1250
[ 180.399845] [<ffffffff81074405>] ? lock_release_holdtime+0x35/0x1c0
[ 180.399845] [<ffffffff8110ee8c>] path_walk+0x5c/0xb0
[ 180.399845] [<ffffffff8110ef43>] do_path_lookup+0x63/0xb0
[ 180.399845] [<ffffffff8110ffeb>] do_filp_open+0xfb/0xb40
[ 180.399845] [<ffffffff81012ad0>] ? native_sched_clock+0x20/0x70
[ 180.399845] [<ffffffff81012b29>] ? sched_clock+0x9/0x10
[ 180.399845] [<ffffffff812777ae>] ? _raw_spin_unlock+0x7e/0xa0
[ 180.399845] [<ffffffff8111b571>] ? alloc_fd+0x131/0x160
[ 180.399845] [<ffffffff810ffa57>] do_sys_open+0x87/0x140
[ 180.399845] [<ffffffff810ffb50>] sys_open+0x20/0x30
[ 180.399845] [<ffffffff8100c072>] system_call_fastpath+0x16/0x1b

I noticed that when udevd is waiting for NFS IO, there is a steady
250K/s network traffic.

Any suggestions on further debugging?

Thanks,
Fengguang


2009-05-03 12:40:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: very slow NFS boot on linux-next and -mm

Ah it's not really a NFS problem..

On Sun, May 03, 2009 at 06:54:56PM +0800, Wu Fengguang wrote:
>
> My NFSROOT box sometimes will get stuck for a dozen seconds during boot.
>
[snip]
> The NFSROOT client that got stuck is running latest linux-next, the
> server side is running 2.6.30-rc3.
>
> Switching the client kernel to latest -mm makes it very reproducible.
> Here is another stack dump:
>
> [ 180.399845] udevd D ffff8800280291e0 3056 1388 1

The udevd is doing this busy loop like mad:

open("/etc/group", O_RDONLY|O_CLOEXEC) = 7
lseek(7, 0, SEEK_CUR) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=797, ...}) = 0
mmap(NULL, 797, PROT_READ, MAP_SHARED, 7, 0) = 0x7f6791601000
lseek(7, 797, SEEK_SET) = 797
munmap(0x7f6791601000, 797) = 0
close(7) = 0

udev version is 0.141-1.

Thanks,
Fengguang


2009-05-03 12:56:06

by Kay Sievers

[permalink] [raw]
Subject: Re: very slow NFS boot on linux-next and -mm

On Sun, May 3, 2009 at 14:40, Wu Fengguang <[email protected]> wro=
te:

> The udevd is doing this busy loop like mad:
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0open("/etc/group", O_RDONLY|O_CLOEXEC) =C2=
=A0=3D 7
> =C2=A0 =C2=A0 =C2=A0 =C2=A0lseek(7, 0, SEEK_CUR) =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D 0
> =C2=A0 =C2=A0 =C2=A0 =C2=A0fstat(7, {st_mode=3DS_IFREG|0644, st_size=3D=
797, ...}) =3D 0
> =C2=A0 =C2=A0 =C2=A0 =C2=A0mmap(NULL, 797, PROT_READ, MAP_SHARED, 7, =
0) =3D 0x7f6791601000
> =C2=A0 =C2=A0 =C2=A0 =C2=A0lseek(7, 797, SEEK_SET) =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D 797
> =C2=A0 =C2=A0 =C2=A0 =C2=A0munmap(0x7f6791601000, 797) =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D 0
> =C2=A0 =C2=A0 =C2=A0 =C2=A0close(7) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0=3D 0
>
> udev version is 0.141-1.

Are you sure that's "busy loop" in that sense, and not just "load"?

If your config carries OWNER=3D or GROUP=3D rules in the udev rules, bu=
t
your system does not have these users or groups configured, glibc will
try to resolve these names to the uid/gid with every event that needs
these values.

If the in rules configured user/group names can be resolved at startup
of udevd, they will be cached and never looked up again.

Are you sure, that your system is configured correctly regarding the
used rules and referenced system users/groups names?

Thanks,
Kay

2009-05-06 13:42:58

by Kay Sievers

[permalink] [raw]
Subject: Re: very slow NFS boot on linux-next and -mm

On Mon, May 4, 2009 at 15:30, Wu Fengguang <[email protected]> wro=
te:

> I tried remove every udev rules in /etc/udev/ and /lib/udev, the /etc=
/group
> accesses disappeared in strace, but udevd is still busy.

> ppoll([{fd=3D4, events=3DPOLLIN}, {fd=3D5, events=3DPOLLIN}, {fd=3D3,=
events=3DPOLLIN}], 3, NULL, [], 8) =3D 1 ([{fd=3D3, revents=3DPOLLIN}]=
)
> ioctl(3, FIONREAD, [39]) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0=3D 0
> read(3, 0x62ad60, 39) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 =C2=A0 =3D -1 EINVAL (Invalid argument)

Seems, you have issues with inotify on your nfs mount?

Inotify wakes up udevd to tell something in the rules directory has
changed, but inotify seems not to return anything useful, but keeps
waking us up. That causes an endless loop of parsing rules files.

Thanks,
Kay

2009-05-07 11:59:47

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH] inotify: report rounded-up event size to user space

On Wed, May 06, 2009 at 09:42:58PM +0800, Kay Sievers wrote:
> On Mon, May 4, 2009 at 15:30, Wu Fengguang <[email protected]> w=
rote:
>=20
> > I tried remove every udev rules in /etc/udev/ and /lib/udev, the /e=
tc/group
> > accesses disappeared in strace, but udevd is still busy.
>=20
> > ppoll([{fd=3D4, events=3DPOLLIN}, {fd=3D5, events=3DPOLLIN}, {fd=3D=
3, events=3DPOLLIN}], 3, NULL, [], 8) =3D 1 ([{fd=3D3, revents=3DPOLLIN=
}])
> > ioctl(3, FIONREAD, [39]) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0=3D 0
> > read(3, 0x62ad60, 39) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =3D -1 EINVAL (Invalid argument)
>=20
> Seems, you have issues with inotify on your nfs mount?
>=20
> Inotify wakes up udevd to tell something in the rules directory has
> changed, but inotify seems not to return anything useful, but keeps
> waking us up. That causes an endless loop of parsing rules files.

Thanks for the tip. The failed inotify read() is caused by the size *ro=
undup*
behavior introduced by the -mm commit 3b46cf7d5f3ca(Reimplement inotify=
_user
using fsnotify). Which says:

+ /*
+ * We need to pad the filename so as to properly align =
an
+ * array of inotify_event structures. Because the stru=
cture is
+ * small and the common case is a small filename, we ju=
st round
+ * up to the next multiple of the structure's sizeof. =
This is
+ * simple and safe for all architectures.
+ */

The udev madness originates from these kernel testing failures:

[ 756.569243] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.600103] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.630265] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.670862] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.701845] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.732899] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.763126] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.794829] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.824985] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.856760] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 761.608521] __ratelimit: 210 callbacks suppressed

Which are printed by the following patch:

--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
static struct fsnotify_event *get_one_event(struct fsnotify_group *gro=
up,
size_t count)
{
size_t event_size =3D sizeof(struct inotify_event);
struct fsnotify_event *event;
=20
if (fsnotify_notify_queue_is_empty(group))
return NULL;
=20
event =3D fsnotify_peek_notify_event(group);
=20
event_size +=3D roundup(event->name_len, event_size);
=20
- if (event_size > count)
+ if (event_size > count) {
+ if (printk_ratelimit())
+ printk("get_one_event: event_size=3D%d > count=3D%d, name_len=3D%d,=
name=3D%s\n",
+ (int)event_size, (int)count, (int)event->name_len, event->file_na=
me);
return ERR_PTR(-EINVAL);
+ }


It can be fixed by reporting the rounded up value to user space.

Thanks,
=46engguang
---
inotify: report rounded-up event size to user space

=46ix a udev madness problem, which falls into an endless loop:

(1) ppoll([{fd=3D4, events=3DPOLLIN}, {fd=3D5, events=3DPOLLIN}, {fd=3D=
3, events=3DPOLLIN}], 3, NULL, [], 8) =3D 1 ([{fd=3D3, revents=3DPOLLIN=
}])
(2) ioctl(3, FIONREAD, [39]) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0=3D 0
(3) read(3, 0x62ad60, 39) =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =3D -1 EINVAL (Invalid argument)

In (2) we reported a small len, while in (3) we insist on a rounded up =
len,
leading to a failed inotify read(), which will be retried endlessly by =
udev.

[ 756.569243] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.600103] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.630265] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.670862] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.701845] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.732899] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.763126] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.794829] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.824985] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 756.856760] get_one_event: event_size=3D48 > count=3D38, name_len=3D=
22, name=3D61-dev-root-link.rules
[ 761.608521] __ratelimit: 210 callbacks suppressed

Signed-off-by: Wu Fengguang <[email protected]>
---
fs/notify/inotify/inotify_user.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- linux.orig/fs/notify/inotify/inotify_user.c
+++ linux/fs/notify/inotify/inotify_user.c
@@ -318,7 +318,9 @@ static long inotify_ioctl(struct file *f
mutex_lock(&group->notification_mutex);
list_for_each_entry(holder, &group->notification_list, event_list) {
event =3D holder->event;
- send_len +=3D sizeof(struct inotify_event) + event->name_len;
+ send_len +=3D sizeof(struct inotify_event);
+ send_len +=3D roundup(event->name_len,
+ sizeof(struct inotify_event));
}
mutex_unlock(&group->notification_mutex);
ret =3D put_user(send_len, (int __user *) p);

2009-05-07 12:16:48

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH] inotify: report rounded-up event size to user space

On Thu, 2009-05-07 at 19:59 +0800, Wu Fengguang wrote:

> ---
> inotify: report rounded-up event size to user space
>
> Fix a udev madness problem, which falls into an endless loop:
>
> (1) ppoll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=3, events=POLLIN}], 3, NULL, [], 8) = 1 ([{fd=3, revents=POLLIN}])
> (2) ioctl(3, FIONREAD, [39]) = 0
> (3) read(3, 0x62ad60, 39) = -1 EINVAL (Invalid argument)
>
> In (2) we reported a small len, while in (3) we insist on a rounded up len,
> leading to a failed inotify read(), which will be retried endlessly by udev.
>
> [ 756.569243] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.600103] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.630265] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.670862] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.701845] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.732899] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.763126] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.794829] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.824985] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 756.856760] get_one_event: event_size=48 > count=38, name_len=22, name=61-dev-root-link.rules
> [ 761.608521] __ratelimit: 210 callbacks suppressed
>
> Signed-off-by: Wu Fengguang <[email protected]>

Gaaah should have remembered to do that with the ioctl.

Acked-by: Eric Paris <[email protected]>

Andrew can you add this to your tree, I will roll it into my next patch
set.

-Eric

> ---
> fs/notify/inotify/inotify_user.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> --- linux.orig/fs/notify/inotify/inotify_user.c
> +++ linux/fs/notify/inotify/inotify_user.c
> @@ -318,7 +318,9 @@ static long inotify_ioctl(struct file *f
> mutex_lock(&group->notification_mutex);
> list_for_each_entry(holder, &group->notification_list, event_list) {
> event = holder->event;
> - send_len += sizeof(struct inotify_event) + event->name_len;
> + send_len += sizeof(struct inotify_event);
> + send_len += roundup(event->name_len,
> + sizeof(struct inotify_event));
> }
> mutex_unlock(&group->notification_mutex);
> ret = put_user(send_len, (int __user *) p);