2017-06-27 11:59:50

by Paul Menzel

[permalink] [raw]
Subject: Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

Dear Brian, dear Christoph,


Just a small update that we were hit by the problem on a different
machine (identical model) with Linux 4.9.32 and the exact same symptoms.

```
$ sudo cat /proc/2085/stack
[<ffffffff811f920c>] iomap_write_begin+0x8c/0x120
[<ffffffff811f982b>] iomap_zero_range_actor+0xeb/0x210
[<ffffffff811f9a82>] iomap_apply+0xa2/0x110
[<ffffffff811f9c58>] iomap_zero_range+0x58/0x80
[<ffffffff8133c7de>] xfs_zero_eof+0x4e/0xb0
[<ffffffff8133c9dd>] xfs_file_aio_write_checks+0x19d/0x1c0
[<ffffffff8133ce89>] xfs_file_buffered_aio_write+0x79/0x2d0
[<ffffffff8133d17e>] xfs_file_write_iter+0x9e/0x150
[<ffffffff81198dc0>] do_iter_readv_writev+0xa0/0xf0
[<ffffffff81199fba>] do_readv_writev+0x18a/0x230
[<ffffffff8119a2ac>] vfs_writev+0x3c/0x50
[<ffffffffffffffff>] 0xffffffffffffffff
```

We haven’t had time to set up a test system yet to analyze that further.


Kind regards,

Paul


Attachments:
=?utf-8?Q?20170627=E2=80=93linux=5F4=2E9=2E32=E2=80=93nfsd=5Flocking=2Etxt?= (51.61 kB)

2017-06-28 16:41:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

On Tue, Jun 27, 2017 at 01:59:46PM +0200, Paul Menzel wrote:
> Dear Brian, dear Christoph,
>
>
> Just a small update that we were hit by the problem on a different machine
> (identical model) with Linux 4.9.32 and the exact same symptoms.

Thanks for the update. I have to admit this issue completely dropped
off my table, but I'll try to get back to it in the next days.

2017-08-01 17:49:52

by Paul Menzel

[permalink] [raw]
Subject: Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

Dear Brian, dear Christoph,


On 06/27/17 13:59, Paul Menzel wrote:

> Just a small update that we were hit by the problem on a different
> machine (identical model) with Linux 4.9.32 and the exact same symptoms.
>
> ```
> $ sudo cat /proc/2085/stack
> [<ffffffff811f920c>] iomap_write_begin+0x8c/0x120
> [<ffffffff811f982b>] iomap_zero_range_actor+0xeb/0x210
> [<ffffffff811f9a82>] iomap_apply+0xa2/0x110
> [<ffffffff811f9c58>] iomap_zero_range+0x58/0x80
> [<ffffffff8133c7de>] xfs_zero_eof+0x4e/0xb0
> [<ffffffff8133c9dd>] xfs_file_aio_write_checks+0x19d/0x1c0
> [<ffffffff8133ce89>] xfs_file_buffered_aio_write+0x79/0x2d0
> [<ffffffff8133d17e>] xfs_file_write_iter+0x9e/0x150
> [<ffffffff81198dc0>] do_iter_readv_writev+0xa0/0xf0
> [<ffffffff81199fba>] do_readv_writev+0x18a/0x230
> [<ffffffff8119a2ac>] vfs_writev+0x3c/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> ```
>
> We haven’t had time to set up a test system yet to analyze that further.

Today, two systems with Linux 4.9.23 exhibited the problem of `top`
showing that `nfsd` is at 100 %. Restarting one machine into Linux
*4.9.38* showed the same problem. One of them with a 1 GBit/s network
device got traffic from a 10 GBit/s system, so the connection was saturated.

But looking at the current processes, process with ID 829 does not show
that CPU usage value. (I don’t remember what the behavior during the
other incidents on other machines was.)

This is from 4.9.38.

The stack trace of the process with ID 857 is below.

```
$ sudo cat /proc/857/stack
[<ffffffff811f94fc>] iomap_write_begin+0x6c/0x120
[<ffffffffffffffff>] 0xffffffffffffffff
```

Then I executed Brian’s commands.

```
$ trace-cmd start -p function -l iomap* -P <pid>
$ cat /sys/kernel/debug/tracing/trace_pipe
```

This produced output.

```
nfsd-857 [005] .... 426.650413: iomap_write_end
<-iomap_zero_range_actor
nfsd-857 [005] .... 426.650414: iomap_write_begin
<-iomap_zero_range_actor
CPU:5 [LOST 115621695 EVENTS]
nfsd-857 [005] .... 453.070439: iomap_write_begin
<-iomap_zero_range_actor
nfsd-857 [005] .... 453.070440: iomap_write_end
<-iomap_zero_range_actor
```

This continues endlessly.


Kind regards,

Paul

2017-08-01 22:57:51

by Dave Chinner

[permalink] [raw]
Subject: Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

On Tue, Aug 01, 2017 at 07:49:50PM +0200, Paul Menzel wrote:
> Dear Brian, dear Christoph,
>
>
> On 06/27/17 13:59, Paul Menzel wrote:
>
> >Just a small update that we were hit by the problem on a different
> >machine (identical model) with Linux 4.9.32 and the exact same
> >symptoms.
> >
> >```
> >$ sudo cat /proc/2085/stack
> >[<ffffffff811f920c>] iomap_write_begin+0x8c/0x120
> >[<ffffffff811f982b>] iomap_zero_range_actor+0xeb/0x210
> >[<ffffffff811f9a82>] iomap_apply+0xa2/0x110
> >[<ffffffff811f9c58>] iomap_zero_range+0x58/0x80
> >[<ffffffff8133c7de>] xfs_zero_eof+0x4e/0xb0
> >[<ffffffff8133c9dd>] xfs_file_aio_write_checks+0x19d/0x1c0
> >[<ffffffff8133ce89>] xfs_file_buffered_aio_write+0x79/0x2d0
> >[<ffffffff8133d17e>] xfs_file_write_iter+0x9e/0x150
> >[<ffffffff81198dc0>] do_iter_readv_writev+0xa0/0xf0
> >[<ffffffff81199fba>] do_readv_writev+0x18a/0x230
> >[<ffffffff8119a2ac>] vfs_writev+0x3c/0x50
> >[<ffffffffffffffff>] 0xffffffffffffffff
> >```
> >
> >We haven’t had time to set up a test system yet to analyze that further.
>
> Today, two systems with Linux 4.9.23 exhibited the problem of `top`
> showing that `nfsd` is at 100 %. Restarting one machine into Linux
> *4.9.38* showed the same problem. One of them with a 1 GBit/s
> network device got traffic from a 10 GBit/s system, so the
> connection was saturated.

So the question is this: is there IO being issued here, is the page
cache growing, or is it in a tight loop doing nothing? Details of
your hardware, XFS config and NFS server config is kinda important
here, too.

For example, if the NFS server IO patterns trigger a large
speculative delayed allocation, then the client does a write at the
end of the speculative delalloc range, we will zero the entire
speculative delalloc range. That could be several GB of zeros that
need to be written here. It's sub-optimal, yes, and but large
zeroing is rare enough that we haven't needed to optimise it by
allocating unwritten extents instead. It would be really handy to
know what application the NFS client is running as that might give
insight into the trigger behaviour and whether you are hitting this
case.

Also, if the NFS client is only writing to one file, then all the
other writes that are on the wire will end up being serviced by nfsd
threads that then block waiting for the inode lock. If the client
issues more writes on the wire thant he NFS server has worker
threads, the client side write will starve the NFS server of
worker threads until the zeroing completes. This is the behaviour
you are seeing - it's a common server side config error that's been
known for at least 15 years...

FWIW, it used to be that a linux NFS client could have 16 concurrent
outstanding NFS RPCs to a server at a time - I don't know if that
limit still exists or whether it's been increased. However, the
typical knfsd default is (still) only 8 worker threads, meaning a
single client and server using default configs can cause the above
server DOS issue. e.g on a bleeding edge debian distro install:

$ head -2 /etc/default/nfs-kernel-server
# Number of servers to start up
RPCNFSDCOUNT=8
$

So, yeah, distros still only configure the nfs server with 8 worker
thread by default. If it's a dedicated NFS server, then I'd be using
somewhere around 64 NFSD threads *per CPU* as a starting point for
the server config...

At minimum, you need to ensure that the NFS server has at least
double the number of server threads as the largest client side
concurrent RPC count so that a single client can't DOS the NFS
server with a single blocked write stream.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-08-10 14:11:37

by Paul Menzel

[permalink] [raw]
Subject: Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

Dear Dave,


On 08/02/17 00:51, Dave Chinner wrote:
> On Tue, Aug 01, 2017 at 07:49:50PM +0200, Paul Menzel wrote:

>> On 06/27/17 13:59, Paul Menzel wrote:
>>
>>> Just a small update that we were hit by the problem on a different
>>> machine (identical model) with Linux 4.9.32 and the exact same
>>> symptoms.
>>>
>>> ```
>>> $ sudo cat /proc/2085/stack
>>> [<ffffffff811f920c>] iomap_write_begin+0x8c/0x120
>>> [<ffffffff811f982b>] iomap_zero_range_actor+0xeb/0x210
>>> [<ffffffff811f9a82>] iomap_apply+0xa2/0x110
>>> [<ffffffff811f9c58>] iomap_zero_range+0x58/0x80
>>> [<ffffffff8133c7de>] xfs_zero_eof+0x4e/0xb0
>>> [<ffffffff8133c9dd>] xfs_file_aio_write_checks+0x19d/0x1c0
>>> [<ffffffff8133ce89>] xfs_file_buffered_aio_write+0x79/0x2d0
>>> [<ffffffff8133d17e>] xfs_file_write_iter+0x9e/0x150
>>> [<ffffffff81198dc0>] do_iter_readv_writev+0xa0/0xf0
>>> [<ffffffff81199fba>] do_readv_writev+0x18a/0x230
>>> [<ffffffff8119a2ac>] vfs_writev+0x3c/0x50
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>> ```
>>>
>>> We haven’t had time to set up a test system yet to analyze that further.
>>
>> Today, two systems with Linux 4.9.23 exhibited the problem of `top`
>> showing that `nfsd` is at 100 %. Restarting one machine into Linux
>> *4.9.38* showed the same problem. One of them with a 1 GBit/s
>> network device got traffic from a 10 GBit/s system, so the
>> connection was saturated.
>
> So the question is this: is there IO being issued here, is the page
> cache growing, or is it in a tight loop doing nothing? Details of
> your hardware, XFS config and NFS server config is kinda important
> here, too.

Could you please guide me, where I can get the information you request?

The hardware ranges from slow 12 thread systems with 96 GB RAM to 80
thread 1 TB RAM machines. Often big files (up to 100 GB) are written.

> For example, if the NFS server IO patterns trigger a large
> speculative delayed allocation, then the client does a write at the
> end of the speculative delalloc range, we will zero the entire
> speculative delalloc range. That could be several GB of zeros that
> need to be written here. It's sub-optimal, yes, and but large
> zeroing is rare enough that we haven't needed to optimise it by
> allocating unwritten extents instead. It would be really handy to
> know what application the NFS client is running as that might give
> insight into the trigger behaviour and whether you are hitting this
> case.

It ranges from simple `cp` to scripts writing FASTQ files with
biological sequences in it.

> Also, if the NFS client is only writing to one file, then all the
> other writes that are on the wire will end up being serviced by nfsd
> threads that then block waiting for the inode lock. If the client
> issues more writes on the wire than the NFS server has worker
> threads, the client side write will starve the NFS server of
> worker threads until the zeroing completes. This is the behaviour
> you are seeing - it's a common server side config error that's been
> known for at least 15 years...
>
> FWIW, it used to be that a linux NFS client could have 16 concurrent
> outstanding NFS RPCs to a server at a time - I don't know if that
> limit still exists or whether it's been increased. However, the
> typical knfsd default is (still) only 8 worker threads, meaning a
> single client and server using default configs can cause the above
> server DOS issue. e.g on a bleeding edge debian distro install:
>
> $ head -2 /etc/default/nfs-kernel-server
> # Number of servers to start up
> RPCNFSDCOUNT=8
> $
>
> So, yeah, distros still only configure the nfs server with 8 worker
> thread by default. If it's a dedicated NFS server, then I'd be using
> somewhere around 64 NFSD threads *per CPU* as a starting point for
> the server config...
>
> At minimum, you need to ensure that the NFS server has at least
> double the number of server threads as the largest client side
> concurrent RPC count so that a single client can't DOS the NFS
> server with a single blocked write stream.

That’s not the issue here. It’s started with 64 threads here. Also this
doesn’t explain, why it works with the 4.4 series.

The directory cannot be accessed at all. `ls /mounted/path` just hangs
on remote systems.


Kind regards,

Paul

2017-08-10 20:00:04

by Markus Stockhausen

[permalink] [raw]
Subject: AW: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

Hi Paul et al,

> Dear Dave,
>
>
> On 08/02/17 00:51, Dave Chinner wrote:
> > On Tue, Aug 01, 2017 at 07:49:50PM +0200, Paul Menzel wrote:
>
> >> On 06/27/17 13:59, Paul Menzel wrote:
> >>
> >>> Just a small update that we were hit by the problem on a different
> >>> machine (identical model) with Linux 4.9.32 and the exact same
> >>> symptoms.
> >>>
> >>> ```
> >>> $ sudo cat /proc/2085/stack
> >>> [<ffffffff811f920c>] iomap_write_begin+0x8c/0x120
> >>> [<ffffffff811f982b>] iomap_zero_range_actor+0xeb/0x210
> >>> [<ffffffff811f9a82>] iomap_apply+0xa2/0x110 [<ffffffff811f9c58>]
> >>> iomap_zero_range+0x58/0x80 [<ffffffff8133c7de>]
> >>> xfs_zero_eof+0x4e/0xb0 [<ffffffff8133c9dd>]
> >>> xfs_file_aio_write_checks+0x19d/0x1c0
> >>> [<ffffffff8133ce89>] xfs_file_buffered_aio_write+0x79/0x2d0
> >>> [<ffffffff8133d17e>] xfs_file_write_iter+0x9e/0x150
> >>> [<ffffffff81198dc0>] do_iter_readv_writev+0xa0/0xf0
> >>> [<ffffffff81199fba>] do_readv_writev+0x18a/0x230
> >>> [<ffffffff8119a2ac>] vfs_writev+0x3c/0x50 [<ffffffffffffffff>]
> >>> 0xffffffffffffffff ```
> >>>
> >>> We haven’t had time to set up a test system yet to analyze that further.
> >>
> >> Today, two systems with Linux 4.9.23 exhibited the problem of `top`
> >> showing that `nfsd` is at 100 %. Restarting one machine into Linux
> >> *4.9.38* showed the same problem. One of them with a 1 GBit/s network
> >> device got traffic from a 10 GBit/s system, so the connection was
> >> saturated.
> >
> > So the question is this: is there IO being issued here, is the page
> > cache growing, or is it in a tight loop doing nothing? Details of your
> > hardware, XFS config and NFS server config is kinda important here,
> > too.

Looking at 64 bit input for variable count of

iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
void *data, struct iomap *iomap)


and 32 bit internal variables

unsigned offset, bytes;

culprit could be

bytes = min_t(unsigned, PAGE_SIZE - offset, count);

Lets say you are trying to zero multiple of 4GB chunks. With bytes
evaluated towards 0 this will hit an endless loop within that iomap
function. That might explain your observation. If that is right a bugfix
would qualify for stable 4.8+

Best regards.

Markus

> Could you please guide me, where I can get the information you request?
>
> The hardware ranges from slow 12 thread systems with 96 GB RAM to 80 thread 1 TB RAM machines. Often big files (up to 100 GB) are written.
>
> > For example, if the NFS server IO patterns trigger a large speculative
> > delayed allocation, then the client does a write at the end of the
> > speculative delalloc range, we will zero the entire speculative
> > delalloc range. That could be several GB of zeros that need to be
> > written here. It's sub-optimal, yes, and but large zeroing is rare
> > enough that we haven't needed to optimise it by allocating unwritten
> > extents instead. It would be really handy to know what application
> > the NFS client is running as that might give insight into the trigger
> > behaviour and whether you are hitting this case.
>
> It ranges from simple `cp` to scripts writing FASTQ files with biological sequences in it.
>
> > > Also, if the NFS client is only writing to one file, then all the
> > other writes that are on the wire will end up being serviced by nfsd
> > threads that then block waiting for the inode lock. If the client
> > issues more writes on the wire than the NFS server has worker threads,
> > the client side write will starve the NFS server of worker threads
> > until the zeroing completes. This is the behaviour you are seeing -
> > it's a common server side config error that's been known for at least
> > 15 years...
> >
> > FWIW, it used to be that a linux NFS client could have 16 concurrent
> > outstanding NFS RPCs to a server at a time - I don't know if that
> > limit still exists or whether it's been increased. However, the
> > typical knfsd default is (still) only 8 worker threads, meaning a
> > single client and server using default configs can cause the above
> > server DOS issue. e.g on a bleeding edge debian distro install:
> >
> > $ head -2 /etc/default/nfs-kernel-server # Number of servers to start
> > up
> > RPCNFSDCOUNT=8
> > $
> >
> > So, yeah, distros still only configure the nfs server with 8 worker
> > thread by default. If it's a dedicated NFS server, then I'd be using
> > somewhere around 64 NFSD threads *per CPU* as a starting point for the
> > server config...
> >
> > At minimum, you need to ensure that the NFS server has at least double
> > the number of server threads as the largest client side concurrent RPC
> > count so that a single client can't DOS the NFS server with a single
> > blocked write stream.
>
> That’s not the issue here. It’s started with 64 threads here. Also this doesn’t explain, why it works with the 4.4 series.
>
> The directory cannot be accessed at all. `ls /mounted/path` just hangs on remote systems.
>
>
> Kind regards,
>
> Paul


Attachments:
InterScan_Disclaimer.txt (1.61 kB)

2017-08-11 10:15:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: AW: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

On Thu, Aug 10, 2017 at 07:54:51PM +0000, Markus Stockhausen wrote:
> Lets say you are trying to zero multiple of 4GB chunks. With bytes
> evaluated towards 0 this will hit an endless loop within that iomap
> function. That might explain your observation. If that is right a bugfix
> would qualify for stable 4.8+

Yes, it seems like min_t casts arguments 2 and 3 to the type in argument
1, which could lead to incorrect truncation.

Paul, please try the patch below:

diff --git a/fs/iomap.c b/fs/iomap.c
index 039266128b7f..59cc98ad7577 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -278,7 +278,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
unsigned long bytes; /* Bytes to write to page */

offset = (pos & (PAGE_SIZE - 1));
- bytes = min_t(unsigned long, PAGE_SIZE - offset, length);
+ bytes = min_t(loff_t, PAGE_SIZE - offset, length);

rpage = __iomap_read_page(inode, pos);
if (IS_ERR(rpage))
@@ -373,7 +373,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
unsigned offset, bytes;

offset = pos & (PAGE_SIZE - 1); /* Within page */
- bytes = min_t(unsigned, PAGE_SIZE - offset, count);
+ bytes = min_t(loff_t, PAGE_SIZE - offset, count);

if (IS_DAX(inode))
status = iomap_dax_zero(pos, offset, bytes, iomap);

2017-08-11 15:14:17

by Paul Menzel

[permalink] [raw]
Subject: Re: Locking problems with Linux 4.9 and 4.11 with NFSD and `fs/iomap.c`

Dear Christoph, dear Markus,


On 08/11/17 12:15, Christoph Hellwig wrote:
> On Thu, Aug 10, 2017 at 07:54:51PM +0000, Markus Stockhausen wrote:
>> Lets say you are trying to zero multiple of 4GB chunks. With bytes
>> evaluated towards 0 this will hit an endless loop within that iomap
>> function. That might explain your observation. If that is right a bugfix
>> would qualify for stable 4.8+
>
> Yes, it seems like min_t casts arguments 2 and 3 to the type in argument
> 1, which could lead to incorrect truncation.
>
> Paul, please try the patch below:
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 039266128b7f..59cc98ad7577 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -278,7 +278,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> unsigned long bytes; /* Bytes to write to page */
>
> offset = (pos & (PAGE_SIZE - 1));
> - bytes = min_t(unsigned long, PAGE_SIZE - offset, length);
> + bytes = min_t(loff_t, PAGE_SIZE - offset, length);
>
> rpage = __iomap_read_page(inode, pos);
> if (IS_ERR(rpage))
> @@ -373,7 +373,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> unsigned offset, bytes;
>
> offset = pos & (PAGE_SIZE - 1); /* Within page */
> - bytes = min_t(unsigned, PAGE_SIZE - offset, count);
> + bytes = min_t(loff_t, PAGE_SIZE - offset, count);
>
> if (IS_DAX(inode))
> status = iomap_dax_zero(pos, offset, bytes, iomap);

I applied this on top of Linux 4.9.41. Even when writing 40 100 GB in
parallel on an NFS exported directory from different systems, and
keeping the load high on the system, the NFS exported directory was
always accessible from all hosts.

Markus, thank you very much for looking into this and spotting the bug.
A bug thank you to all people helping to analyze this issue.

I’ll keep you posted, but until know.

Tested-by: Paul Menzel <[email protected]>


Kind regards,

Paul