2011-02-16 00:56:49

by Eric B Munson

[permalink] [raw]
Subject: NFS Regression in commit 0b26a0bf6ff398

While testing some 2.6.38 work my rsync backup script started consuming
large amounts of memory (all available before dying with no more available
memory). I have bisected the problem back to 0b26a0bf6ff398. I am
unfamiliar with the NFS code so I don't know where to start looking for a
possible fix. My backups files from my home directory to an NFS mounted
directory. The NFS server is a Synology DS-411+ if it matters. Let me know
if there is any other information I can provide.

Eric


Attachments:
(No filename) (494.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-02-16 01:02:41

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS Regression in commit 0b26a0bf6ff398

On Tue, 2011-02-15 at 19:56 -0500, Eric B Munson wrote:
> While testing some 2.6.38 work my rsync backup script started consuming
> large amounts of memory (all available before dying with no more available
> memory). I have bisected the problem back to 0b26a0bf6ff398. I am
> unfamiliar with the NFS code so I don't know where to start looking for a
> possible fix. My backups files from my home directory to an NFS mounted
> directory. The NFS server is a Synology DS-411+ if it matters. Let me know
> if there is any other information I can provide.

Exactly which 2.6.38 kernel are you running, and which NFS version?

I'm having trouble seeing how the patch in question can be responsible
for what you are seeing, so please could you provide more details of
your test setup.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-16 13:37:46

by Eric B Munson

[permalink] [raw]
Subject: Re: NFS Regression in commit 0b26a0bf6ff398

On Tue, 15 Feb 2011, Trond Myklebust wrote:

> On Tue, 2011-02-15 at 19:56 -0500, Eric B Munson wrote:
> > While testing some 2.6.38 work my rsync backup script started consuming
> > large amounts of memory (all available before dying with no more available
> > memory). I have bisected the problem back to 0b26a0bf6ff398. I am
> > unfamiliar with the NFS code so I don't know where to start looking for a
> > possible fix. My backups files from my home directory to an NFS mounted
> > directory. The NFS server is a Synology DS-411+ if it matters. Let me know
> > if there is any other information I can provide.
>
> Exactly which 2.6.38 kernel are you running, and which NFS version?

The problem shows in the head of Linus' tree but I was able to reproduce on
vanilla 2.6.37. I am using NFS version 3.

>
> I'm having trouble seeing how the patch in question can be responsible
> for what you are seeing, so please could you provide more details of
> your test setup.

I have a simple backup script that rsyncs the important files from my home
directory to an NFS mounted directory on the same machine. It is a small
home network, just a single switch between machine and file server. The error
can be reproduced all the way back to that patch, I tested 2.6.37 looking for
a good point to start bisection and had to go back to 2.6.36 to find it. When
it happens, my backup script stalls and the rsync process starts eating all the
free memory on the system. There are three active rsync processes, two are
stalled in calls to select and the third is repeating a call to lstat on the
same file until rsync fails a memory allocation.

>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>


Attachments:
(No filename) (1.73 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-02-24 19:58:18

by Eric B Munson

[permalink] [raw]
Subject: Re: NFS Regression in commit 0b26a0bf6ff398

On Tue, 15 Feb 2011, Trond Myklebust wrote:

> On Tue, 2011-02-15 at 19:56 -0500, Eric B Munson wrote:
> > While testing some 2.6.38 work my rsync backup script started consuming
> > large amounts of memory (all available before dying with no more available
> > memory). I have bisected the problem back to 0b26a0bf6ff398. I am
> > unfamiliar with the NFS code so I don't know where to start looking for a
> > possible fix. My backups files from my home directory to an NFS mounted
> > directory. The NFS server is a Synology DS-411+ if it matters. Let me know
> > if there is any other information I can provide.
>
> Exactly which 2.6.38 kernel are you running, and which NFS version?
>
> I'm having trouble seeing how the patch in question can be responsible
> for what you are seeing, so please could you provide more details of
> your test setup.
>

Trond,

I just updated to 2.6.38-rc6 and I still see the regression. I will add some
more information here that might be useful.

The entries from ps -ef:

user 3116 3096 0 14:45 pts/4 00:00:00 /bin/bash /home/emunson/bin/bu
user 3117 3116 3 14:45 pts/4 00:00:05 rsync -avz --delete --exclude=*.
user 3118 3117 26 14:45 pts/4 00:00:41 rsync -avz --delete --exclude=*.
user 3119 3118 0 14:45 pts/4 00:00:01 rsync -avz --delete --exclude=*.

strace from 3117:
user@machine:~$ sudo strace -p 3117
Process 3117 attached - interrupt to quit
select(6, [5], [], NULL, {33, 382426}) = 0 (Timeout)
select(6, [5], [], NULL, {60, 0}) = 0 (Timeout)
select(6, [5], [], NULL, {60, 0}) = 0 (Timeout)
select(6, [5], [], NULL, {60, 0}) = 0 (Timeout)
select(6, [5], [], NULL, {60, 0}

strace from 3118:
...
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
lstat("linux-2.6/net/netfilter/xt_dscp.c", {st_mode=S_IFREG|0644, st_size=2890, ...}) = 0
...

These stream by very quickly.

and finally from 3119:
user@machine:~$ sudo strace -p 3119
Process 3119 attached - interrupt to quit
select(5, NULL, [4], [4], {15, 621031}) = 0 (Timeout)
select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
select(5, NULL, [4], [4], {60, 0}

Now from my /etc/fstab:

/dev/mapper/isw_gbdfddifh_Volume02 / ext4 errors=remount-ro 0 1
/dev/mapper/isw_gbdfddifh_Volume03 none swap sw 0 0

#NFS
192.168.1.50:/volume1/backup /mnt/backup nfs rsize=1048576,wsize=1048576,user 0 0
192.168.1.50:/volume1/data /mnt/data nfs rsize=1048576,wsize=1048576,user 0 0
192.168.1.50:/volume1/music /mp3 nfs rsize=1048576,wsize=1048576,user 0 0
192.168.1.50:/volume1/video /video nfs rsize=1048576,wsize=1048576,user 0 0

The backup script is reading from my home dir (locally mounted on partition 2 of a
fake raid stripe). And writing to /mnt/backup/bert-ubuntu.

And here is the error I get when rsync finally dies:

ERROR: out of memory in flist_expand [generator]
rsync error: error allocating core memory buffers (code 22) at util.c(117) [generator=3.0.7]
rsync: connection unexpectedly closed (9324 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.7]

At the point when this happens, rsync is consuming almost all of the free memory on the system.

I am nost sure if there is anything else that might help, please let me know if you need more
information.


Attachments:
(No filename) (4.01 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-03-24 18:29:49

by Eric B Munson

[permalink] [raw]
Subject: Re: NFS Regression in commit 0b26a0bf6ff398

On Tue, 15 Feb 2011, Trond Myklebust wrote:

> On Tue, 2011-02-15 at 19:56 -0500, Eric B Munson wrote:
> > While testing some 2.6.38 work my rsync backup script started consuming
> > large amounts of memory (all available before dying with no more available
> > memory). I have bisected the problem back to 0b26a0bf6ff398. I am
> > unfamiliar with the NFS code so I don't know where to start looking for a
> > possible fix. My backups files from my home directory to an NFS mounted
> > directory. The NFS server is a Synology DS-411+ if it matters. Let me know
> > if there is any other information I can provide.
>
> Exactly which 2.6.38 kernel are you running, and which NFS version?
>
> I'm having trouble seeing how the patch in question can be responsible
> for what you are seeing, so please could you provide more details of
> your test setup.
>

I am running 2.6.38 and still having this problem. The strace output from
earlier still applies and here is output from SysRq+L for the strace processes:

[20965.685696] SysRq : Show backtrace of all active CPUs
[20965.685699] sending NMI to all CPUs:
[20965.685702] NMI backtrace for cpu 0
[20965.685704] CPU 0
[20965.685705] Modules linked in: binfmt_misc nfs lockd fscache nfs_acl auth_rpcgss sunrpc kvm_intel kvm parport_pc ppdev snd_hda_codec_hdmi snd_hda_codec_realtek radeon deflate zlib_deflate ctr twofish_generic twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic aesni_intel cryptd aes_x86_64 aes_generic xcbc rmd160 sha512_generic sha256_generic snd_hda_intel snd_hda_codec ttm sha1_generic crypto_null af_key snd_usb_audio snd_usbmidi_lib snd_hwdep snd_pcm snd_seq_midi snd_rawmidi drm_kms_helper snd_seq_midi_event snd_seq drm snd_timer snd_seq_device uvcvideo snd hwmon_vid i7core_edac psmouse xhci_hcd max6650 videodev edac_core joydev lp asus_atk0110 parport snd_page_alloc hid_microsoft i2c_algo_bit soundcore v4l2_compat_ioctl32 serio_raw usbhid hid firewire_ohci firewire_core crc_itu_t sky2 ahci libahci
[20965.685747]
[20965.685748] Pid: 9210, comm: rsync Not tainted 2.6.38+ #38 System manufacturer System Product Name/P6X58D PREMIUM
[20965.685752] RIP: 0010:[<ffffffffa0551870>] [<ffffffffa0551870>] put_rpccred+0x40/0x150 [sunrpc]
[20965.685763] RSP: 0018:ffff88032178fc88 EFLAGS: 00000202
[20965.685765] RAX: 0000000000000006 RBX: ffff88031ae27500 RCX: 8c6318c6318c6320
[20965.685766] RDX: 000000000000fcfb RSI: ffff8800bd600000 RDI: ffff88031ae27500
[20965.685767] RBP: ffff88032178fc98 R08: 0000000000000000 R09: 0000000000000001
[20965.685769] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[20965.685770] R13: ffff880296962778 R14: ffff88031ae27500 R15: 0000000000000000
[20965.685772] FS: 00007fe320eb8700(0000) GS:ffff8800bd600000(0000) knlGS:0000000000000000
[20965.685773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20965.685774] CR2: 000000000a9c1fe0 CR3: 00000003243f7000 CR4: 00000000000006f0
[20965.685776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20965.685777] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[20965.685779] Process rsync (pid: 9210, threadinfo ffff88032178e000, task ffff8803228bc620)
[20965.685780] Stack:
[20965.685781] 0000000000000000 0000000000000001 ffff88032178fcd8 ffffffffa079a0da
[20965.685785] ffff88032178fd28 ffff880323d2a009 ffff880323d2a009 ffff88032178fdc8
[20965.685788] ffff880296962778 ffff8803228bc620 ffff88032178fd78 ffffffff81171112
[20965.685791] Call Trace:
[20965.685799] [<ffffffffa079a0da>] nfs_permission+0xea/0x1d0 [nfs]
[20965.685803] [<ffffffff81171112>] link_path_walk+0x222/0xaa0
[20965.685807] [<ffffffff8116f9a8>] ? path_init_rcu+0x98/0x270
[20965.685809] [<ffffffff81171c6b>] do_path_lookup+0x5b/0x140
[20965.685811] [<ffffffff811729f7>] user_path_at+0x57/0xa0
[20965.685818] [<ffffffff81125bcc>] ? might_fault+0x5c/0xb0
[20965.685820] [<ffffffff81167d68>] ? cp_new_stat+0xf8/0x110
[20965.685822] [<ffffffff81167f96>] vfs_fstatat+0x46/0x80
[20965.685824] [<ffffffff8116803e>] vfs_lstat+0x1e/0x20
[20965.685826] [<ffffffff81168064>] sys_newlstat+0x24/0x50
[20965.685830] [<ffffffff815ac999>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20965.685834] [<ffffffff81002f82>] system_call_fastpath+0x16/0x1b
[20965.685836] Code: 00 48 8b 47 50 48 89 fb a8 04 75 1f f0 ff 4f 58 0f 94 c0 84 c0 0f 85 fd 00 00 00 48 8b 1c 24 4c 8b 64 24 08 c9 c3 eb 03 90 90 90 <48> 83 c7 58 48 c7 c6 a0 cd 56 a0 e8 f0 60 d8 e0 85 c0 74 dc 48
[20965.685874] Call Trace:
[20965.685883] [<ffffffffa079a0da>] nfs_permission+0xea/0x1d0 [nfs]
[20965.685888] [<ffffffff81171112>] link_path_walk+0x222/0xaa0
[20965.685894] [<ffffffff8116f9a8>] ? path_init_rcu+0x98/0x270
[20965.685898] [<ffffffff81171c6b>] do_path_lookup+0x5b/0x140
[20965.685902] [<ffffffff811729f7>] user_path_at+0x57/0xa0
[20965.685907] [<ffffffff81125bcc>] ? might_fault+0x5c/0xb0
[20965.685912] [<ffffffff81167d68>] ? cp_new_stat+0xf8/0x110
[20965.685917] [<ffffffff81167f96>] vfs_fstatat+0x46/0x80
[20965.685921] [<ffffffff8116803e>] vfs_lstat+0x1e/0x20
[20965.685928] [<ffffffff81168064>] sys_newlstat+0x24/0x50
[20965.685931] [<ffffffff815ac999>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20965.685933] [<ffffffff81002f82>] system_call_fastpath+0x16/0x1b
[20965.685935] Pid: 9210, comm: rsync Not tainted 2.6.38+ #38
[20965.685936] Call Trace:
[20965.685937] <NMI> [<ffffffff8100ba27>] ? show_regs+0x27/0x30
[20965.685942] [<ffffffff815af4d6>] ? arch_trigger_all_cpu_backtrace_handler+0x76/0x90
[20965.685945] [<ffffffff815b1786>] ? notifier_call_chain+0x56/0x80
[20965.685947] [<ffffffff815b181c>] ? __atomic_notifier_call_chain+0x6c/0xa0
[20965.685949] [<ffffffff815b17b0>] ? __atomic_notifier_call_chain+0x0/0xa0
[20965.685951] [<ffffffff815b1866>] ? atomic_notifier_call_chain+0x16/0x20
[20965.685953] [<ffffffff815b189e>] ? notify_die+0x2e/0x30
[20965.685956] [<ffffffff815ae58a>] ? do_nmi+0xda/0x290
[20965.685958] [<ffffffff815ae0f0>] ? nmi+0x20/0x39
[20965.685963] [<ffffffffa0551870>] ? put_rpccred+0x40/0x150 [sunrpc]
[20965.685965] <<EOE>> [<ffffffffa079a0da>] ? nfs_permission+0xea/0x1d0 [nfs]
[20965.685971] [<ffffffff81171112>] ? link_path_walk+0x222/0xaa0
[20965.685973] [<ffffffff8116f9a8>] ? path_init_rcu+0x98/0x270
[20965.685975] [<ffffffff81171c6b>] ? do_path_lookup+0x5b/0x140
[20965.685977] [<ffffffff811729f7>] ? user_path_at+0x57/0xa0
[20965.685979] [<ffffffff81125bcc>] ? might_fault+0x5c/0xb0
[20965.685981] [<ffffffff81167d68>] ? cp_new_stat+0xf8/0x110
[20965.685983] [<ffffffff81167f96>] ? vfs_fstatat+0x46/0x80
[20965.685985] [<ffffffff8116803e>] ? vfs_lstat+0x1e/0x20
[20965.685987] [<ffffffff81168064>] ? sys_newlstat+0x24/0x50
[20965.685990] [<ffffffff815ac999>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20965.685992] [<ffffffff81002f82>] ? system_call_fastpath+0x16/0x1b


This is the pid that gets stuck on lstat. Is there anything else that mught be
helpful?


Attachments:
(No filename) (6.77 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-04-06 21:31:34

by Eric B Munson

[permalink] [raw]
Subject: Re: NFS Regression in commit 0b26a0bf6ff398

On Tue, 15 Feb 2011, Trond Myklebust wrote:

> On Tue, 2011-02-15 at 19:56 -0500, Eric B Munson wrote:
> > While testing some 2.6.38 work my rsync backup script started consuming
> > large amounts of memory (all available before dying with no more available
> > memory). I have bisected the problem back to 0b26a0bf6ff398. I am
> > unfamiliar with the NFS code so I don't know where to start looking for a
> > possible fix. My backups files from my home directory to an NFS mounted
> > directory. The NFS server is a Synology DS-411+ if it matters. Let me know
> > if there is any other information I can provide.
>
> Exactly which 2.6.38 kernel are you running, and which NFS version?
>
> I'm having trouble seeing how the patch in question can be responsible
> for what you are seeing, so please could you provide more details of
> your test setup.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>

Whatever was broken was fixed with 2.6.39-rc1. Thanks for your help.

Eric


Attachments:
(No filename) (1.03 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments