2016-05-24 17:15:39

by Gernot Hillier

[permalink] [raw]
Subject: unexpected sync delays in dpkg for small pre-allocated files on ext4

Hi there!

We experience strange delays with kernel 4.1.18 during dpkg package
installation on an ext4 filesystem after switching from Ubuntu 14.04 to
16.04. We can reproduce the issue with kernel 4.6. Installation of the
same package takes 2s with ext3 and 31s with ext4 on the same partition.

Hardware is an Intel-based server with Supermicro X8DTH board and
Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
0x1000:0x0072, mpt2sas driver).

We could track this down to the introduction of fallocate() in recent
dpkg versions and derived the following synthetic test case. First
sync_file_range() call takes ~5ms, 2nd call ~15ms.

fd = open("test1.txt", 0xc1);
ret = fallocate(fd, 0, 0, 20);
ret = write(fd, "hallo", 6);
ret = sync_file_range(fd, 0, 0, 2);
ret = close(fd);

fd = open("test2.txt", 0xc1);
ret = fallocate(fd, 0, 0, 20);
ret = write(fd, "hallo", 6);
ret = sync_file_range(fd, 0, 0, 2);
ret = close(fd);

The delay is caused by sb_issue_zeroout() in ext4_ext_zeroout(), called
by ext4_ext_convert_to_initialized().

Ftrace function_graph snippet with important events (recorded with
kernel 4.6):

7) | ext4_ext_map_blocks() {
[...]
7) | /* ext4_ext_handle_unwritten_extents: ... */
7) | /* ext4_ext_convert_to_initialized_enter: ... */
7) | ext4_ext_zeroout() {
7) | ext4_issue_zeroout() {

[... CPU switches to kworker, then to <idle> ...]
[... 14 ms later, we continue...]

3) | /* irq_handler_entry: irq=25 name=mpt2sas0-msix0 */
3) | /* irq_handler_exit: irq=25 ret=handled */
7) | /* softirq_raise: vec=4 [action=BLOCK] */
7) | /* softirq_entry: vec=4 [action=BLOCK] */
7) | /* block_rq_complete: 8,16 W () 114969912 + 8 [0] */
[...]
7) * 14 ms | } /* ext4_issue_zeroout */

ANY of the following changes reduce the sync_file_range() runtime to <<
1 ms:

* do not use fallocate()
* fallocate() for more than extent_max_zeroout_kb, e.g. 40 kB
* echo 0 > /sys/fs/ext4/sdb2/extent_max_zeroout_kb
* use ext3 instead of ext4
* mount ext4 with -o dioread_nolock

To me, the problem looks comparable to
https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
a full hang and there's no RAID involved for us), so a closer look on
the SCSI layer or driver might be the next step?

Any suggestions how to continue analysis are greatly appreciated!

--
Gernot Hillier, Siemens AG
Corporate Technology, Corporate Competence Center Embedded Linux



2016-05-26 00:44:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
>
> To me, the problem looks comparable to
> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
> a full hang and there's no RAID involved for us), so a closer look on
> the SCSI layer or driver might be the next step?

What I would suggest is to create a small test case which compares the
time it takes to allocate 1 megabyte of memory, zero it, and then
write one megabytes of zeros using the write(2) system call. Then try
writing one megabytes of zero using the BLKZEROOUT ioctl.

I'm pretty sure you'll see same issue with BLKZEROOUT ioctl, but at
this point, we'll be able to send bug reports to the SCSI and block
layer developers with something that makes this very clear that it has
nothing to do with ext4.

This way we can also do some differential diagnosis; given that I'm
not seeing this complaint from most people, I suspect it will be a
matter of adding some specific devices to a blacklist (so that even
though the SCSI device claims to support WRITE SAME, we need to
disable it because it has a really lousy implementation of the SCSI
WRITE SAME command).

Cheers,

- Ted



2016-05-26 02:20:24

by Dave Chinner

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
> Hi there!
>
> We experience strange delays with kernel 4.1.18 during dpkg package
> installation on an ext4 filesystem after switching from Ubuntu 14.04 to
> 16.04. We can reproduce the issue with kernel 4.6. Installation of the
> same package takes 2s with ext3 and 31s with ext4 on the same partition.
>
> Hardware is an Intel-based server with Supermicro X8DTH board and
> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
> 0x1000:0x0072, mpt2sas driver).
>
> We could track this down to the introduction of fallocate() in recent
> dpkg versions and derived the following synthetic test case. First
> sync_file_range() call takes ~5ms, 2nd call ~15ms.
>
> fd = open("test1.txt", 0xc1);
> ret = fallocate(fd, 0, 0, 20);
> ret = write(fd, "hallo", 6);
> ret = sync_file_range(fd, 0, 0, 2);
> ret = close(fd);
>
> fd = open("test2.txt", 0xc1);
> ret = fallocate(fd, 0, 0, 20);
> ret = write(fd, "hallo", 6);
> ret = sync_file_range(fd, 0, 0, 2);
> ret = close(fd);

Stupid question: why is dpkg using fallocate() for such small ranges
like that? I can't think of a more inefficient way to do small IO -
using delayed allocation is far more optimal from a layout,
overhead, latency and IO perspective than the above forced
allocation pseudo-synchronous write behaviour.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-05-26 07:02:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On Thu, May 26, 2016 at 12:20:20PM +1000, Dave Chinner wrote:
> Stupid question: why is dpkg using fallocate() for such small ranges
> like that? I can't think of a more inefficient way to do small IO -
> using delayed allocation is far more optimal from a layout,
> overhead, latency and IO perspective than the above forced
> allocation pseudo-synchronous write behaviour.

Below is the commit adding it. Guillem, can you explain what fallocate
in the dpkg is supposed to help with? The way it's it does fallocate
before writes that aren't spare it looks actively harmful for any
Linux file system.

commit 87b0b20b86407baf1deb4e91b3fd839e01228ac8
Author: Guillem Jover <[email protected]>
Date: Tue Jul 15 21:00:52 2014 +0200

dpkg: Try to preallocate the disk size for extracted files

This might help in avoiding filesystem fragmentation, and possibly
improve performance on some filesystems.

We use the system specific methods if available, and possibly
improve performance on some filesystems.

We use the system specific methods if available, and only use
posix_fallocate() if nothing else is available, because on some systems
its semantics are counter to what we want to obtain here, as the libc
library will fallback to manually writing '\0' to each block to force
the preallocation, instead of just failing and leaving the application
to do so if desired.

2016-05-30 08:27:52

by Gernot Hillier

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

Hi!

On 25.05.2016 01:13, Theodore Ts'o wrote:
> On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
>> We experience strange delays with kernel 4.1.18 during dpkg
>> package installation on an ext4 filesystem after switching from
>> Ubuntu 14.04 to 16.04. We can reproduce the issue with kernel 4.6.
>> Installation of the same package takes 2s with ext3 and 31s with
>> ext4 on the same partition.
>>
>> Hardware is an Intel-based server with Supermicro X8DTH board and
>> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
>> 0x1000:0x0072, mpt2sas driver).
[...]
>> To me, the problem looks comparable to
>> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
>> a full hang and there's no RAID involved for us), so a closer look on
>> the SCSI layer or driver might be the next step?
>
> What I would suggest is to create a small test case which compares the
> time it takes to allocate 1 megabyte of memory, zero it, and then
> write one megabytes of zeros using the write(2) system call. Then try
> writing one megabytes of zero using the BLKZEROOUT ioctl.

Ok, this is my test code:

const int SIZE = 1*1024*1024;
char* buffer = malloc(SIZE);
uint64_t range[2] = { 0, SIZE };
int fd = open("/dev/sdb2", O_WRONLY);

bzero(buffer, SIZE);
write(fd, buffer, SIZE);
sync_file_range(fd, 0, 0, 2);

ioctl (fd, BLKZEROOUT, range);

close(fd);
free(buffer);

# strace -tt ./test-tytso
[...]
15:46:27.481636 open("/dev/sdb2", O_WRONLY) = 3
15:46:27.482004 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
15:46:27.482438 sync_file_range(3, 0, 0, SYNC_FILE_RANGE_WRITE) = 0
15:46:27.482698 ioctl(3, BLKZEROOUT, [0, 100000]) = 0
15:46:27.546971 close(3) = 0

So the write() and sync_file_range() in the first case takes ~400 us
each while BLKZEROOUT takes... 60 ms. Wow.

> I'm pretty sure you'll see same issue with BLKZEROOUT ioctl, but at
> this point, we'll be able to send bug reports to the SCSI and block
> layer developers with something that makes this very clear that it has
> nothing to do with ext4.

Ok, I included linux-scsi and mpt2sas maintainers to the thread. Can you
please help to narrow this down further?

> This way we can also do some differential diagnosis; given that I'm
> not seeing this complaint from most people, I suspect it will be a
> matter of adding some specific devices to a blacklist (so that even
> though the SCSI device claims to support WRITE SAME, we need to
> disable it because it has a really lousy implementation of the SCSI
> WRITE SAME command).

Even a BLKZEROOUT for 512 bytes takes nearly 5 ms on this machine. This
really qualifies as "lousy implementation".

Any idea how I could find out whether disks or controller are to blame
here? Getting physical access to the machine and replacing disks might
take us some days, so any other suggestion is greatly appreciated!

--
With kind regards,

Gernot Hillier, Siemens AG
Corporate Technology, Corporate Competence Center Embedded Linux


2016-05-30 19:04:54

by Jun He

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

> On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
> >
> > To me, the problem looks comparable to
> > https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
> > a full hang and there's no RAID involved for us), so a closer look on
> > the SCSI layer or driver might be the next step?
>
> What I would suggest is to create a small test case which compares the
> time it takes to allocate 1 megabyte of memory, zero it, and then
> write one megabytes of zeros using the write(2) system call. Then try
> writing one megabytes of zero using the BLKZEROOUT ioctl.
>
> I'm pretty sure you'll see same issue with BLKZEROOUT ioctl, but at
> this point, we'll be able to send bug reports to the SCSI and block
> layer developers with something that makes this very clear that it has
> nothing to do with ext4.
>
> This way we can also do some differential diagnosis; given that I'm
> not seeing this complaint from most people, I suspect it will be a
> matter of adding some specific devices to a blacklist (so that even
> though the SCSI device claims to support WRITE SAME, we need to
> disable it because it has a really lousy implementation of the SCSI
> WRITE SAME command).
>
> Cheers,
>
> - Ted

I just want to add that I once experienced a similar problem: huge delay
with fallocate() and a SSD. It turns out block layer zeroouts extents by
discarding them (blkdev_issue_zeroout()), and the Intel SSD is VERY slow
at discarding.


2016-05-31 00:21:52

by Dave Chinner

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On Mon, May 30, 2016 at 10:27:52AM +0200, Gernot Hillier wrote:
> Hi!
>
> On 25.05.2016 01:13, Theodore Ts'o wrote:
> > On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
> >> We experience strange delays with kernel 4.1.18 during dpkg
> >> package installation on an ext4 filesystem after switching from
> >> Ubuntu 14.04 to 16.04. We can reproduce the issue with kernel 4.6.
> >> Installation of the same package takes 2s with ext3 and 31s with
> >> ext4 on the same partition.
> >>
> >> Hardware is an Intel-based server with Supermicro X8DTH board and
> >> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
> >> 0x1000:0x0072, mpt2sas driver).
> [...]
> >> To me, the problem looks comparable to
> >> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
> >> a full hang and there's no RAID involved for us), so a closer look on
> >> the SCSI layer or driver might be the next step?
> >
> > What I would suggest is to create a small test case which compares the
> > time it takes to allocate 1 megabyte of memory, zero it, and then
> > write one megabytes of zeros using the write(2) system call. Then try
> > writing one megabytes of zero using the BLKZEROOUT ioctl.
>
> Ok, this is my test code:
>
> const int SIZE = 1*1024*1024;
> char* buffer = malloc(SIZE);
> uint64_t range[2] = { 0, SIZE };
> int fd = open("/dev/sdb2", O_WRONLY);
>
> bzero(buffer, SIZE);
> write(fd, buffer, SIZE);
> sync_file_range(fd, 0, 0, 2);
>
> ioctl (fd, BLKZEROOUT, range);
>
> close(fd);
> free(buffer);
>
> # strace -tt ./test-tytso
> [...]
> 15:46:27.481636 open("/dev/sdb2", O_WRONLY) = 3
> 15:46:27.482004 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
> 15:46:27.482438 sync_file_range(3, 0, 0, SYNC_FILE_RANGE_WRITE) = 0
> 15:46:27.482698 ioctl(3, BLKZEROOUT, [0, 100000]) = 0
> 15:46:27.546971 close(3) = 0
>
> So the write() and sync_file_range() in the first case takes ~400 us
> each while BLKZEROOUT takes... 60 ms. Wow.

Comparing apples to oranges.

Unlike the name implies, sync_file_range() does not provide any data
integrity semantics what-so-ever: SYNC_FILE_RANGE_WRITE only submits
IO to clean dirty pages - that only takes 400us of CPU time. It
does not wait for completion, nor does it flush the drive cache and
so by the time the syscall returns to userspace the IO may not have
even been sent to the device (e.g. it could be queued by the IO
scheduler in the block layer). i.e. you're not timing IO, you're
timing CPU overhead of IO submission.

For an apples to apples comparison, you need to use fsync() to
physically force the written data to stable storage and wait for
completion. This is what BLKZEROOUT is effectively doing, so I think
you'll find fdatasync() also takes around 60ms...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-06-01 09:44:16

by Gernot Hillier

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

Hi!

On 31.05.2016 02:21, Dave Chinner wrote:
> On Mon, May 30, 2016 at 10:27:52AM +0200, Gernot Hillier wrote:
>> Hi!
>>
>> On 25.05.2016 01:13, Theodore Ts'o wrote:
>>> On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
>>>> We experience strange delays with kernel 4.1.18 during dpkg
>>>> package installation on an ext4 filesystem after switching from
>>>> Ubuntu 14.04 to 16.04. We can reproduce the issue with kernel 4.6.
>>>> Installation of the same package takes 2s with ext3 and 31s with
>>>> ext4 on the same partition.
>>>>
>>>> Hardware is an Intel-based server with Supermicro X8DTH board and
>>>> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
>>>> 0x1000:0x0072, mpt2sas driver).
>> [...]
>>>> To me, the problem looks comparable to
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
>>>> a full hang and there's no RAID involved for us), so a closer look on
>>>> the SCSI layer or driver might be the next step?
>>>
>>> What I would suggest is to create a small test case which compares the
>>> time it takes to allocate 1 megabyte of memory, zero it, and then
>>> write one megabytes of zeros using the write(2) system call. Then try
>>> writing one megabytes of zero using the BLKZEROOUT ioctl.
>>
>> Ok, this is my test code:
>>
>> const int SIZE = 1*1024*1024;
>> char* buffer = malloc(SIZE);
>> uint64_t range[2] = { 0, SIZE };
>> int fd = open("/dev/sdb2", O_WRONLY);
>>
>> bzero(buffer, SIZE);
>> write(fd, buffer, SIZE);
>> sync_file_range(fd, 0, 0, 2);
>>
>> ioctl (fd, BLKZEROOUT, range);
>>
>> close(fd);
>> free(buffer);
>>
>> # strace -tt ./test-tytso
>> [...]
>> 15:46:27.481636 open("/dev/sdb2", O_WRONLY) = 3
>> 15:46:27.482004 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
>> 15:46:27.482438 sync_file_range(3, 0, 0, SYNC_FILE_RANGE_WRITE) = 0
>> 15:46:27.482698 ioctl(3, BLKZEROOUT, [0, 100000]) = 0
>> 15:46:27.546971 close(3) = 0
>>
>> So the write() and sync_file_range() in the first case takes ~400 us
>> each while BLKZEROOUT takes... 60 ms. Wow.
>
> Comparing apples to oranges.
>
> Unlike the name implies, sync_file_range() does not provide any data
> integrity semantics what-so-ever: SYNC_FILE_RANGE_WRITE only submits
> IO to clean dirty pages - that only takes 400us of CPU time. It
[...]
> completion. This is what BLKZEROOUT is effectively doing, so I think
> you'll find fdatasync() also takes around 60ms...

Oops, sorry for that! I just copied the sync pattern which initially
caused the delays in dpkg.

With updated test, I still reproducably see a factor of ~2.5 between
write+fdatasync (16 ms) and BLKZEROOUT (40 ms), as an example:

17:12:30.742679 open("/dev/sdb2", O_WRONLY) = 3
17:12:30.742733 fdatasync(3) = 0
17:12:30.743148 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
17:12:30.743605 fdatasync(3) = 0
17:12:30.758984 ioctl(3, BLKZEROOUT, [0, 100000]) = 0
17:12:30.798937 close(3) = 0

So, I'm a bit confused now. Does this mean we see three bugs here?

1) inefficient use of fallocate() + sync_file_range() by dpkg
(was also reported as [1])
2) BLKZEROOUT more then 2 times slower than write+fsync
3) again unexpected ext4 delay caused by fallocate()+sync_file_range()

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824636

--
With kind regards,
Gernot Hillier

Siemens AG, Corporate Technology
Corporate Competence Center Embedded Linux


2016-06-01 13:17:21

by Gernot Hillier

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On 01.06.2016 11:44, Gernot Hillier wrote:
> Hi!
>
> On 31.05.2016 02:21, Dave Chinner wrote:
>> On Mon, May 30, 2016 at 10:27:52AM +0200, Gernot Hillier wrote:
>>> Hi!
>>>
>>> On 25.05.2016 01:13, Theodore Ts'o wrote:
>>>> On Tue, May 24, 2016 at 07:07:41PM +0200, Gernot Hillier wrote:
>>>>> We experience strange delays with kernel 4.1.18 during dpkg
>>>>> package installation on an ext4 filesystem after switching from
>>>>> Ubuntu 14.04 to 16.04. We can reproduce the issue with kernel 4.6.
>>>>> Installation of the same package takes 2s with ext3 and 31s with
>>>>> ext4 on the same partition.
>>>>>
>>>>> Hardware is an Intel-based server with Supermicro X8DTH board and
>>>>> Seagate ST973451SS disks connected to an LSI SAS2008 controller (PCI
>>>>> 0x1000:0x0072, mpt2sas driver).
>>> [...]
>>>>> To me, the problem looks comparable to
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=56821 (even if we don't see
>>>>> a full hang and there's no RAID involved for us), so a closer look on
>>>>> the SCSI layer or driver might be the next step?
>>>>
>>>> What I would suggest is to create a small test case which compares the
>>>> time it takes to allocate 1 megabyte of memory, zero it, and then
>>>> write one megabytes of zeros using the write(2) system call. Then try
>>>> writing one megabytes of zero using the BLKZEROOUT ioctl.
>>>
>>> Ok, this is my test code:
>>>
>>> const int SIZE = 1*1024*1024;
>>> char* buffer = malloc(SIZE);
>>> uint64_t range[2] = { 0, SIZE };
>>> int fd = open("/dev/sdb2", O_WRONLY);
>>>
>>> bzero(buffer, SIZE);
>>> write(fd, buffer, SIZE);
>>> sync_file_range(fd, 0, 0, 2);
>>>
>>> ioctl (fd, BLKZEROOUT, range);
>>>
>>> close(fd);
>>> free(buffer);
>>>
>>> # strace -tt ./test-tytso
>>> [...]
>>> 15:46:27.481636 open("/dev/sdb2", O_WRONLY) = 3
>>> 15:46:27.482004 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
>>> 15:46:27.482438 sync_file_range(3, 0, 0, SYNC_FILE_RANGE_WRITE) = 0
>>> 15:46:27.482698 ioctl(3, BLKZEROOUT, [0, 100000]) = 0
>>> 15:46:27.546971 close(3) = 0
>>>
>>> So the write() and sync_file_range() in the first case takes ~400 us
>>> each while BLKZEROOUT takes... 60 ms. Wow.
>>
>> Comparing apples to oranges.
>>
>> Unlike the name implies, sync_file_range() does not provide any data
>> integrity semantics what-so-ever: SYNC_FILE_RANGE_WRITE only submits
>> IO to clean dirty pages - that only takes 400us of CPU time. It
> [...]
>> completion. This is what BLKZEROOUT is effectively doing, so I think
>> you'll find fdatasync() also takes around 60ms...
>
> Oops, sorry for that! I just copied the sync pattern which initially
> caused the delays in dpkg.
>
> With updated test, I still reproducably see a factor of ~2.5 between
> write+fdatasync (16 ms) and BLKZEROOUT (40 ms), as an example:
>
> 17:12:30.742679 open("/dev/sdb2", O_WRONLY) = 3
> 17:12:30.742733 fdatasync(3) = 0
> 17:12:30.743148 write(3, "\0\0\0\0\0\0"..., 1048576) = 1048576
> 17:12:30.743605 fdatasync(3) = 0
> 17:12:30.758984 ioctl(3, BLKZEROOUT, [0, 100000]) = 0
> 17:12:30.798937 close(3) = 0
>
> So, I'm a bit confused now. Does this mean we see three bugs here?
>
> 1) inefficient use of fallocate() + sync_file_range() by dpkg
> (was also reported as [1])
> 2) BLKZEROOUT more then 2 times slower than write+fsync
> 3) again unexpected ext4 delay caused by fallocate()+sync_file_range()
>
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824636

BTW, that bug reminded me that we initially saw this issue also on a
second machine, sorry for not mentioning/checking this earlier:

- Intel server with Supermicro X9DR3-F board, Intel C606 SAS controller
(PCI id 8086:1d68) with Seagate ST300MM0006 disks.

I repeated the discussed tests and found comparable results on this machine:

- 3 seconds dpkg install time on ext3 vs. 80 seconds for ext4
on same partition for same package
- 40 ms for fallocate+write+sync_file_range for writing a few bytes
- 15 ms for write+fdatasync vs. 45 ms for BLKZEROOUT on raw device

So this seems to be not bound to one specific disk+controller setup, but
it still can't be a common problem affecting many people as then we
would see more reports about it...

--
Gernot

2016-06-01 14:12:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On Wed, Jun 01, 2016 at 03:17:14PM +0200, Gernot Hillier wrote:
> I repeated the discussed tests and found comparable results on this machine:
>
> - 3 seconds dpkg install time on ext3 vs. 80 seconds for ext4
> on same partition for same package
> - 40 ms for fallocate+write+sync_file_range for writing a few bytes
> - 15 ms for write+fdatasync vs. 45 ms for BLKZEROOUT on raw device
>
> So this seems to be not bound to one specific disk+controller setup, but
> it still can't be a common problem affecting many people as then we
> would see more reports about it...

OK, so let's try to get common baseline, shall we? I'm using as my
test package:

http://debug.mirrors.debian.org/debian-debug/pool/main/e/e2fsprogs/e2fsprogs-dbgsym_1.43-2_amd64.deb

% ls -s /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
1108 /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
% md5sum /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
148ee631f46ed2e05a17faf36cc9d630 /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb

# dpkg --purge e2fsprogs-dbgsym
(Reading database ... 322967 files and directories currently installed.)
Removing e2fsprogs-dbgsym (1.43-3) ...

# /usr/bin/time dpkg --force-depends -i /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
Selecting previously unselected package e2fsprogs-dbgsym.
(Reading database ... 322947 files and directories currently installed.)
Preparing to unpack .../e2fsprogs-dbgsym_1.43-3_amd64.deb ...
Unpacking e2fsprogs-dbgsym (1.43-3) ...
Setting up e2fsprogs-dbgsym (1.43-3) ...
0.21user 0.19system 0:00.54elapsed 74%CPU (0avgtext+0avgdata 73188maxresident)k
0inputs+10168outputs (0major+24935minor)pagefaults 0swaps

Something else would be useful is to make sure /tmp is a tmpfs mounted
file system (so you don't get any I/O's caused by /tmp), and then do:

# dpkg --purge e2fsprogs-dbgsym
# btrace /dev/sda3 > /tmp/btrace.out &
# /usr/bin/time dpkg --force-depends -i /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
# kill %btrace

Replace /dev/sda3 with the device name of your root partition, and
then send us the compressed btrace.out file, which might give us a
clue about what is going on.

Cheers,

- Ted

2016-06-02 16:23:48

by Gernot Hillier

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

On 01.06.2016 16:12, Theodore Ts'o wrote:
> On Wed, Jun 01, 2016 at 03:17:14PM +0200, Gernot Hillier wrote:
>> I repeated the discussed tests and found comparable results on this machine:
>>
>> - 3 seconds dpkg install time on ext3 vs. 80 seconds for ext4
>> on same partition for same package
>> - 40 ms for fallocate+write+sync_file_range for writing a few bytes
>> - 15 ms for write+fdatasync vs. 45 ms for BLKZEROOUT on raw device
>>
>> So this seems to be not bound to one specific disk+controller setup, but
>> it still can't be a common problem affecting many people as then we
>> would see more reports about it...
>
> OK, so let's try to get common baseline, shall we?

Sure!

> I'm using as my test package:
>
> http://debug.mirrors.debian.org/debian-debug/pool/main/e/e2fsprogs/e2fsprogs-dbgsym_1.43-3_amd64.deb
[...]
> # /usr/bin/time dpkg --force-depends -i /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
[...]
> 0.21user 0.19system 0:00.54elapsed 74%CPU (0avgtext+0avgdata 73188maxresident)k
> 0inputs+10168outputs (0major+24935minor)pagefaults 0swaps

On machine with Intel SCSI controller and on ext4 root, I see:

0.15user 0.00system 0:00.92elapsed 16%CPU (0avgtext+0avgdata
14144maxresident)k
0inputs+0outputs (0major+5447minor)pagefaults 0swaps

Installing to some ext3 partition:

0.14user 0.01system 0:00.56elapsed 26%CPU (0avgtext+0avgdata
14092maxresident)k
0inputs+0outputs (0major+5431minor)pagefaults 0swaps

This doesn't really advertise the issue as it only installs a small
number of files - yet, I still see a reproducible factor of 1.5-2 for
total elapsed time between ext3 and ext4...

> Something else would be useful is to make sure /tmp is a tmpfs mounted
> file system (so you don't get any I/O's caused by /tmp), and then do:

Ok.

> # dpkg --purge e2fsprogs-dbgsym
> # btrace /dev/sda3 > /tmp/btrace.out &
> # /usr/bin/time dpkg --force-depends -i /tmp/e2fsprogs-dbgsym_1.43-3_amd64.deb
> # kill %btrace

Please find btrace.out.bz2 attached. HTH...

Please note that I'm offline for the next couple of days, so there will
be some delay in my answers!

--
Gernot Hillier

Siemens AG, Corporate Technology
Corporate Competence Center Embedded Linux


Attachments:
btrace.out.bz2 (13.87 kB)

2016-07-13 13:57:18

by Gernot Hillier

[permalink] [raw]
Subject: Re: unexpected sync delays in dpkg for small pre-allocated files on ext4

Dear Ted,

On 01.06.2016 16:12, Theodore Ts'o wrote:
> On Wed, Jun 01, 2016 at 03:17:14PM +0200, Gernot Hillier wrote:
>> I repeated the discussed tests and found comparable results on this machine:
>>
>> - 3 seconds dpkg install time on ext3 vs. 80 seconds for ext4
>> on same partition for same package
>> - 40 ms for fallocate+write+sync_file_range for writing a few bytes
>> - 15 ms for write+fdatasync vs. 45 ms for BLKZEROOUT on raw device
>>
>> So this seems to be not bound to one specific disk+controller setup, but
>> it still can't be a common problem affecting many people as then we
>> would see more reports about it...

Sorry for 2nd follow-up and the long delay (caused by vacation +
illness), but we went over the whole discussion again and can now likely
confirm your assumption of a bad SCSI WRITE SAME implementation on our
hardware.

We found that by issuing "echo 0 >
/sys/devices/.../scsi_disk/0:0:0:0/max_write_same_blocks" we can also
fix our issue, reducing dpkg installation time for e.g. linux-headers
from minutes to seconds.

Did you already have time to look into my old btrace from June, 2nd, did
it help somehow?

If not, please find updated btraces from two machines, for slow and fast
case, on each one below.

Does this provide enough details to blacklist our devices for SCSI WRITE
SAME?

> OK, so let's try to get common baseline, shall we? I'm using as my
> test package:
>
> http://debug.mirrors.debian.org/debian-debug/pool/main/e/e2fsprogs/e2fsprogs-dbgsym_1.43-2_amd64.deb

This time, I decided to use a test package with a bit more files to make
the issue more obvious:

http://snapshot.debian.org/archive/debian/20160712T042309Z/pool/main/m/manpages/manpages_4.06-1_all.deb

# md5sum manpages_4.06-1_all.deb
251aadf9a0117cac8248343a9f09d74b manpages_4.06-1_all.deb

On machine 1, a Supermicro X8DTH board with Xeon E5520 CPU, equipped
with a SEAGATE ST973451SS disk connected to an LSI SAS2008 controller
(mpt3sas.ko):

(If you do this at home, please note that this will overwrite your local
manpages, so you need to go back to original package afterwards!)

# /usr/bin/time dpkg --no-triggers --force-depends --unpack
manpages_4.06-1_all.deb
[...]
0.13user 0.07system 0:03.38elapsed 6%CPU (0avgtext+0avgdata
15132maxresident)k
0inputs+4040outputs (0major+6066minor)pagefaults 0swaps

# echo 0 >
/sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/max_write_same_blocks

# /usr/bin/time dpkg --no-triggers --force-depends --unpack
manpages_4.06-1_all.deb
[...]
0.11user 0.08system 0:00.55elapsed 35%CPU (0avgtext+0avgdata
15168maxresident)k
0inputs+4056outputs (0major+6219minor)pagefaults 0swaps

So disabling SCSI WRITE SAME reduced installation time for a package
with ~300 files from 3.5 s to 1 s.

On machine 2, a Supermicro X9DR3-F board with Xeon E5-2620, SEAGATE
ST300MM0006 disk connected to an Intel C606 SAS controller, default
installation time for manpages_4.06-1_all.deb was 9s, dropping to 1s
after "echo 0 > .../max_write_same_blocks".

For "real" packages like linux-headers, disabling SCSI WRITE SAME
reduced installation time from minutes to seconds.

(For sure, I repeated those measurements several times with switching
back and forth between both cases, rebooting etc.)

I followed your instructions, assured that btrace output ends up on
tmpfs and recorded four new btraces. As installation of manpages.deb
results in a 350k btrace (40k compressed), I decided to provide URLs
instead of attachments this time:

https://github.com/gernot-h/slow-dpkg/blob/master/btrace-machine1-write_same.out.bz2?raw=true
-> btrace for default (slow) case on machine1 (i.e. after reboot or
"echo 65535 > .../max_write_same_blocks"

https://github.com/gernot-h/slow-dpkg/blob/master/btrace-machine1-no_write_same.out.bz2?raw=true
-> btrace for fast case on machine1 (i.e. after "echo 0 >
.../max_write_same_blocks"

https://github.com/gernot-h/slow-dpkg/blob/master/btrace-machine2-write_same.out.bz2?raw=true
-> btrace for default (slow) case on machine2 (i.e. after reboot or
"echo 65535 > .../max_write_same_blocks"

https://github.com/gernot-h/slow-dpkg/blob/master/btrace-machine2-no_write_same.out.bz2?raw=true
-> btrace for fast case on machine1 (i.e. after "echo 0 >
.../max_write_same_blocks"

Thanks again and please let me know if you need any additional traces,
test results, measurements, bug report, etc.!

--
With kind regards,
Gernot Hillier

Siemens AG, Corporate Technology
Corporate Competence Center Embedded Linux