2008-02-28 11:31:56

by Allard Hoeve

[permalink] [raw]
Subject: Kernel 2.6.23.17 crash (Was: Kernel (2.6.24) crash on nfsd (BUG: soft lockup))


Dear Mailinglist,

After trying 2.6.23.17, the same happened. The stacktrace is a bit
different, but they are comparable.

Is this an NFS problem in the first place? Where could we go for help with
this problem?

Regards,

Allard Hoeve



Pid: 2643, comm: nfsd
EIP: 0060:[<c0179a3a>] CPU: 3
EIP is at __generic_file_splice_read+0x12c/0x418
EFLAGS: 00000206 Not tainted (2.6.23.17-fwsh-byte #3)
EAX: f6e9dddc EBX: 00001000 ECX: 00000001 EDX: 00000000
ESI: 00000000 EDI: f6e9dcd0 EBP: 00000095 DS: 007b ES: 007b FS: 00d8
CR0: 8005003b CR2: b7e72cc0 CR3: 00622000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c0113cbb>] __check_preempt_curr_fair+0x4b/0x7d
[<c0113dcb>] entity_tick+0x47/0x54
[<c013006b>] getnstimeofday+0x37/0x111
[<c0132fb6>] clockevents_program_event+0xac/0xcc
[<c0122996>] run_timer_softirq+0x30/0x184
[<c012f36f>] hrtimer_interrupt+0x132/0x1c4
[<c011f0e0>] __do_softirq+0xba/0xcf
[<c010da6a>] smp_apic_timer_interrupt+0x2c/0x35
[<c01032bc>] apic_timer_interrupt+0x28/0x30
[<c0179da7>] generic_file_splice_read+0x81/0xd5
[<c017a6b0>] do_splice_to+0x75/0x97
[<c017a771>] splice_direct_to_actor+0x9f/0x166
[<f8f2a494>] nfsd_acceptable+0x0/0xd1 [nfsd]
[<f8f2c247>] nfsd_direct_splice_actor+0x0/0xa [nfsd]
[<f8f2c5ea>] nfsd_vfs_read+0x399/0x3bd [nfsd]
[<c015d57f>] dentry_open+0x34/0x64
[<f8f2ca1d>] nfsd_read+0xee/0xfb [nfsd]
[<f8f332ab>] nfsd3_proc_read+0xfe/0x186 [nfsd]
[<f8f34cd9>] nfs3svc_decode_readargs+0x0/0xeb [nfsd]
[<f8f28847>] nfsd_dispatch+0xc5/0x1ca [nfsd]
[<c043ab82>] svcauth_unix_set_client+0x116/0x165
[<c0436b96>] svc_process+0x4fb/0x6d4
[<c01164ad>] default_wake_function+0x0/0xc
[<f8f2863d>] nfsd+0x16a/0x282 [nfsd]
[<f8f284d3>] nfsd+0x0/0x282 [nfsd]
[<c010343f>] kernel_thread_helper+0x7/0x10



2008-03-01 16:39:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Kernel 2.6.23.17 crash (Was: Kernel (2.6.24) crash on nfsd (BUG: soft lockup))

On Thu, Feb 28, 2008 at 11:56:51AM +0100, Allard Hoeve wrote:
> After trying 2.6.23.17, the same happened. The stacktrace is a bit
> different, but they are comparable.
>
> Is this an NFS problem in the first place? Where could we go for help
> with this problem?

Thanks for the reports!

So, the summary: several people are reporting soft lockup warnings with
_generic_file_splice_read as the latest or next-to-latest function on
the stack. Sounds like 2.6.18 is good, various kernels around 2.6.23
and 2.6.24 are reported bad. Is it possible this was a regression
introduced by the splice changes?

--b.

>
> Regards,
>
> Allard Hoeve
>
>
>
> Pid: 2643, comm: nfsd
> EIP: 0060:[<c0179a3a>] CPU: 3
> EIP is at __generic_file_splice_read+0x12c/0x418
> EFLAGS: 00000206 Not tainted (2.6.23.17-fwsh-byte #3)
> EAX: f6e9dddc EBX: 00001000 ECX: 00000001 EDX: 00000000
> ESI: 00000000 EDI: f6e9dcd0 EBP: 00000095 DS: 007b ES: 007b FS: 00d8
> CR0: 8005003b CR2: b7e72cc0 CR3: 00622000 CR4: 000006f0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> [<c0113cbb>] __check_preempt_curr_fair+0x4b/0x7d
> [<c0113dcb>] entity_tick+0x47/0x54
> [<c013006b>] getnstimeofday+0x37/0x111
> [<c0132fb6>] clockevents_program_event+0xac/0xcc
> [<c0122996>] run_timer_softirq+0x30/0x184
> [<c012f36f>] hrtimer_interrupt+0x132/0x1c4
> [<c011f0e0>] __do_softirq+0xba/0xcf
> [<c010da6a>] smp_apic_timer_interrupt+0x2c/0x35
> [<c01032bc>] apic_timer_interrupt+0x28/0x30
> [<c0179da7>] generic_file_splice_read+0x81/0xd5
> [<c017a6b0>] do_splice_to+0x75/0x97
> [<c017a771>] splice_direct_to_actor+0x9f/0x166
> [<f8f2a494>] nfsd_acceptable+0x0/0xd1 [nfsd]
> [<f8f2c247>] nfsd_direct_splice_actor+0x0/0xa [nfsd]
> [<f8f2c5ea>] nfsd_vfs_read+0x399/0x3bd [nfsd]
> [<c015d57f>] dentry_open+0x34/0x64
> [<f8f2ca1d>] nfsd_read+0xee/0xfb [nfsd]
> [<f8f332ab>] nfsd3_proc_read+0xfe/0x186 [nfsd]
> [<f8f34cd9>] nfs3svc_decode_readargs+0x0/0xeb [nfsd]
> [<f8f28847>] nfsd_dispatch+0xc5/0x1ca [nfsd]
> [<c043ab82>] svcauth_unix_set_client+0x116/0x165
> [<c0436b96>] svc_process+0x4fb/0x6d4
> [<c01164ad>] default_wake_function+0x0/0xc
> [<f8f2863d>] nfsd+0x16a/0x282 [nfsd]
> [<f8f284d3>] nfsd+0x0/0x282 [nfsd]
> [<c010343f>] kernel_thread_helper+0x7/0x10
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-03-01 17:03:28

by Jens Axboe

[permalink] [raw]
Subject: Re: Kernel 2.6.23.17 crash (Was: Kernel (2.6.24) crash on nfsd (BUG: soft lockup))

On Sat, Mar 01 2008, J. Bruce Fields wrote:
> On Thu, Feb 28, 2008 at 11:56:51AM +0100, Allard Hoeve wrote:
> > After trying 2.6.23.17, the same happened. The stacktrace is a bit
> > different, but they are comparable.
> >
> > Is this an NFS problem in the first place? Where could we go for help
> > with this problem?
>
> Thanks for the reports!
>
> So, the summary: several people are reporting soft lockup warnings with
> _generic_file_splice_read as the latest or next-to-latest function on
> the stack. Sounds like 2.6.18 is good, various kernels around 2.6.23
> and 2.6.24 are reported bad. Is it possible this was a regression
> introduced by the splice changes?

I posted this two days ago, but didn't get a reply back regarding if
anyone who can reproduce tested it?

diff --git a/fs/splice.c b/fs/splice.c
index 9b559ee..0254ec6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -370,8 +370,10 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
* for an in-flight io page
*/
if (flags & SPLICE_F_NONBLOCK) {
- if (TestSetPageLocked(page))
+ if (TestSetPageLocked(page)) {
+ error = -EAGAIN;
break;
+ }
} else
lock_page(page);

@@ -479,9 +481,8 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
- ssize_t spliced;
- int ret;
loff_t isize, left;
+ int ret;

isize = i_size_read(in->f_mapping->host);
if (unlikely(*ppos >= isize))
@@ -491,29 +492,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
if (unlikely(left < len))
len = left;

- ret = 0;
- spliced = 0;
- while (len && !spliced) {
- ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
-
- if (ret < 0)
- break;
- else if (!ret) {
- if (spliced)
- break;
- if (flags & SPLICE_F_NONBLOCK) {
- ret = -EAGAIN;
- break;
- }
- }
-
+ ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
+ if (ret > 0)
*ppos += ret;
- len -= ret;
- spliced += ret;
- }
-
- if (spliced)
- return spliced;

return ret;
}

--
Jens Axboe


2008-03-05 10:25:36

by Gertjan Oude Lohuis

[permalink] [raw]
Subject: Re: Kernel 2.6.23.17 crash (Was: Kernel (2.6.24) crash on nfsd (BUG: soft lockup))

Hi Jens et al,

On 03/01/2008 06:03 PM, Jens Axboe wrote:
> On Sat, Mar 01 2008, J. Bruce Fields wrote:
>> So, the summary: several people are reporting soft lockup warnings with
>> _generic_file_splice_read as the latest or next-to-latest function on
>> the stack. Sounds like 2.6.18 is good, various kernels around 2.6.23
>> and 2.6.24 are reported bad. Is it possible this was a regression
>> introduced by the splice changes?
>
> I posted this two days ago, but didn't get a reply back regarding if
> anyone who can reproduce tested it?
>
> diff --git a/fs/splice.c b/fs/splice.c

<snip patch>

I'm sorry we didn't respond any earlier. We've been quite busy dividing
our data over multiple fileservers to lower the load on the primary
server, and in the process we downgraded the kernels on the NFS-servers
to 2.6.22.19.
Since then we haven't seen another crash. My gut feeling says that the
downgraded kernels were the 'solution', but it could also be that the
lowered load has prevented the servers from crashing.

At the moment we won't be able to test your patch, simply because we
can't afford any more crashes. However, if 2.6.22.19 does crash in the
same way in the near future, I'll try your patch.

Thanks for your interest and help!

Regards,
Gertjan Oude Lohuis