2021-08-05 03:59:51

by Timothy Pearson

[permalink] [raw]
Subject: Callback slot table overflowed

All,

We've hit an odd issue after upgrading a main NFS server from Debian Stretch to Debian Buster. In both cases the 5.13.4 kernel was used, however after the upgrade none of our ARM thin clients can mount their root filesystems -- early in the boot process I/O errors are returned immediately following "Callback slot table overflowed" in the client dmesg.

I am unable to find any useful information on this "Callback slot table overflowed" message, and have no idea why it is only impacting our ARM (armel) clients. Both 4.14 and 5.3 on the client side show the issue, other client kernel versions were not tested.

Curiously, increasing the rsize/wsize values to 65536 or higher reduces (but does not eliminate) the number of callback overflow messages.

The server is a ppc64el 64k page host, and none of our pcc64el or amd64 thin clients are experiencing any problems. Nothing of interest appears in the server message log.

Any troubleshooting hints would be most welcome.

Thank you!


2021-08-05 03:59:51

by Timothy Pearson

[permalink] [raw]
Subject: Re: Callback slot table overflowed

Other information that may be helpful:

All clients are using TCP
arm64 clients are unaffected by the bug
The armel clients use very small (4k) rsize/wsize buffers
Prior to the upgrade from Debian Stretch, everything was working perfectly

----- Original Message -----
> From: "Timothy Pearson" <[email protected]>
> To: "linux-nfs" <[email protected]>
> Sent: Wednesday, August 4, 2021 7:00:20 PM
> Subject: Callback slot table overflowed

> All,
>
> We've hit an odd issue after upgrading a main NFS server from Debian Stretch to
> Debian Buster. In both cases the 5.13.4 kernel was used, however after the
> upgrade none of our ARM thin clients can mount their root filesystems -- early
> in the boot process I/O errors are returned immediately following "Callback
> slot table overflowed" in the client dmesg.
>
> I am unable to find any useful information on this "Callback slot table
> overflowed" message, and have no idea why it is only impacting our ARM (armel)
> clients. Both 4.14 and 5.3 on the client side show the issue, other client
> kernel versions were not tested.
>
> Curiously, increasing the rsize/wsize values to 65536 or higher reduces (but
> does not eliminate) the number of callback overflow messages.
>
> The server is a ppc64el 64k page host, and none of our pcc64el or amd64 thin
> clients are experiencing any problems. Nothing of interest appears in the
> server message log.
>
> Any troubleshooting hints would be most welcome.
>
> Thank you!

2021-08-05 04:15:47

by Timothy Pearson

[permalink] [raw]
Subject: Re: Callback slot table overflowed

On further investigation, the working server had already been rolled back to 4.19.0. Apparently the issue was insurmountable in 5.x.

It should be simple enough to set up a test environment out of production for 5.x, if you have any debug tips / would like to see any debug options compiled in.

Thanks!

----- Original Message -----
> From: "Timothy Pearson" <[email protected]>
> To: "linux-nfs" <[email protected]>
> Sent: Wednesday, August 4, 2021 7:04:16 PM
> Subject: Re: Callback slot table overflowed

> Other information that may be helpful:
>
> All clients are using TCP
> arm64 clients are unaffected by the bug
> The armel clients use very small (4k) rsize/wsize buffers
> Prior to the upgrade from Debian Stretch, everything was working perfectly
>
> ----- Original Message -----
>> From: "Timothy Pearson" <[email protected]>
>> To: "linux-nfs" <[email protected]>
>> Sent: Wednesday, August 4, 2021 7:00:20 PM
>> Subject: Callback slot table overflowed
>
>> All,
>>
>> We've hit an odd issue after upgrading a main NFS server from Debian Stretch to
>> Debian Buster. In both cases the 5.13.4 kernel was used, however after the
>> upgrade none of our ARM thin clients can mount their root filesystems -- early
>> in the boot process I/O errors are returned immediately following "Callback
>> slot table overflowed" in the client dmesg.
>>
>> I am unable to find any useful information on this "Callback slot table
>> overflowed" message, and have no idea why it is only impacting our ARM (armel)
>> clients. Both 4.14 and 5.3 on the client side show the issue, other client
>> kernel versions were not tested.
>>
>> Curiously, increasing the rsize/wsize values to 65536 or higher reduces (but
>> does not eliminate) the number of callback overflow messages.
>>
>> The server is a ppc64el 64k page host, and none of our pcc64el or amd64 thin
>> clients are experiencing any problems. Nothing of interest appears in the
>> server message log.
>>
>> Any troubleshooting hints would be most welcome.
>>
> > Thank you!

2021-08-07 00:09:01

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Callback slot table overflowed

On Thu, Aug 5, 2021 at 12:15 AM Timothy Pearson
<[email protected]> wrote:
>
> On further investigation, the working server had already been rolled back to 4.19.0. Apparently the issue was insurmountable in 5.x.
>
> It should be simple enough to set up a test environment out of production for 5.x, if you have any debug tips / would like to see any debug options compiled in.
>
> Thanks!
>
> ----- Original Message -----
> > From: "Timothy Pearson" <[email protected]>
> > To: "linux-nfs" <[email protected]>
> > Sent: Wednesday, August 4, 2021 7:04:16 PM
> > Subject: Re: Callback slot table overflowed
>
> > Other information that may be helpful:
> >
> > All clients are using TCP
> > arm64 clients are unaffected by the bug
> > The armel clients use very small (4k) rsize/wsize buffers
> > Prior to the upgrade from Debian Stretch, everything was working perfectly
> >
> > ----- Original Message -----
> >> From: "Timothy Pearson" <[email protected]>
> >> To: "linux-nfs" <[email protected]>
> >> Sent: Wednesday, August 4, 2021 7:00:20 PM
> >> Subject: Callback slot table overflowed
> >
> >> All,
> >>
> >> We've hit an odd issue after upgrading a main NFS server from Debian Stretch to
> >> Debian Buster. In both cases the 5.13.4 kernel was used, however after the
> >> upgrade none of our ARM thin clients can mount their root filesystems -- early
> >> in the boot process I/O errors are returned immediately following "Callback
> >> slot table overflowed" in the client dmesg.
> >>
> >> I am unable to find any useful information on this "Callback slot table
> >> overflowed" message, and have no idea why it is only impacting our ARM (armel)
> >> clients. Both 4.14 and 5.3 on the client side show the issue, other client
> >> kernel versions were not tested.
> >>
> >> Curiously, increasing the rsize/wsize values to 65536 or higher reduces (but
> >> does not eliminate) the number of callback overflow messages.
> >>
> >> The server is a ppc64el 64k page host, and none of our pcc64el or amd64 thin
> >> clients are experiencing any problems. Nothing of interest appears in the
> >> server message log.
> >>
> >> Any troubleshooting hints would be most welcome.

A network trace would be useful.

5.3 should have this patch "SUNRPC: Fix up backchannel slot table
accounting". I believe "callback slot table overflowed" is hit when
the server sent more reqs than client can handle (ie doesn't have a
free slot to handle the request). A network trace would show that.
However you said this happens when the client is trying to mount and
besides cb_null requests I'm not sure what could be happening.

> >>
> > > Thank you!

2021-08-07 00:12:21

by Timothy Pearson

[permalink] [raw]
Subject: Re: Callback slot table overflowed



----- Original Message -----
> From: "Olga Kornievskaia" <[email protected]>
> To: "Timothy Pearson" <[email protected]>
> Cc: "linux-nfs" <[email protected]>
> Sent: Friday, August 6, 2021 2:53:19 PM
> Subject: Re: Callback slot table overflowed

> On Thu, Aug 5, 2021 at 12:15 AM Timothy Pearson
> <[email protected]> wrote:
>>
>> On further investigation, the working server had already been rolled back to
>> 4.19.0. Apparently the issue was insurmountable in 5.x.
>>
>> It should be simple enough to set up a test environment out of production for
>> 5.x, if you have any debug tips / would like to see any debug options compiled
>> in.
>>
>> Thanks!
>>
>> ----- Original Message -----
>> > From: "Timothy Pearson" <[email protected]>
>> > To: "linux-nfs" <[email protected]>
>> > Sent: Wednesday, August 4, 2021 7:04:16 PM
>> > Subject: Re: Callback slot table overflowed
>>
>> > Other information that may be helpful:
>> >
>> > All clients are using TCP
>> > arm64 clients are unaffected by the bug
>> > The armel clients use very small (4k) rsize/wsize buffers
>> > Prior to the upgrade from Debian Stretch, everything was working perfectly
>> >
>> > ----- Original Message -----
>> >> From: "Timothy Pearson" <[email protected]>
>> >> To: "linux-nfs" <[email protected]>
>> >> Sent: Wednesday, August 4, 2021 7:00:20 PM
>> >> Subject: Callback slot table overflowed
>> >
>> >> All,
>> >>
>> >> We've hit an odd issue after upgrading a main NFS server from Debian Stretch to
>> >> Debian Buster. In both cases the 5.13.4 kernel was used, however after the
>> >> upgrade none of our ARM thin clients can mount their root filesystems -- early
>> >> in the boot process I/O errors are returned immediately following "Callback
>> >> slot table overflowed" in the client dmesg.
>> >>
>> >> I am unable to find any useful information on this "Callback slot table
>> >> overflowed" message, and have no idea why it is only impacting our ARM (armel)
>> >> clients. Both 4.14 and 5.3 on the client side show the issue, other client
>> >> kernel versions were not tested.
>> >>
>> >> Curiously, increasing the rsize/wsize values to 65536 or higher reduces (but
>> >> does not eliminate) the number of callback overflow messages.
>> >>
>> >> The server is a ppc64el 64k page host, and none of our pcc64el or amd64 thin
>> >> clients are experiencing any problems. Nothing of interest appears in the
>> >> server message log.
>> >>
>> >> Any troubleshooting hints would be most welcome.
>
> A network trace would be useful.
>
> 5.3 should have this patch "SUNRPC: Fix up backchannel slot table
> accounting". I believe "callback slot table overflowed" is hit when
> the server sent more reqs than client can handle (ie doesn't have a
> free slot to handle the request). A network trace would show that.
> However you said this happens when the client is trying to mount and
> besides cb_null requests I'm not sure what could be happening.

I'll work to get a network trace out of the test environment once it's set up. I should however clarify that this is immediately *after* mount, when the diskless ARM device is attempting to run early startup (i.e. reading /etc/init.d and such).

>> >>
> > > > Thank you!