On 2019/11/07 16:04, Andrea Vai wrote:
> Il giorno mer, 06/11/2019 alle 22.13 +0000, Damien Le Moal ha scritto:
>>
>>
>> Please simply try your write tests after doing this:
>>
>> echo mq-deadline > /sys/block/<name of your USB
>> disk>/queue/scheduler
>>
>> And confirm that mq-deadline is selected with:
>>
>> cat /sys/block/<name of your USB disk>/queue/scheduler
>> [mq-deadline] kyber bfq none
>
> ok, which kernel should I test with this: the fresh git cloned, or the
> one just patched with Alan's patch, or doesn't matter which one?
Probably all of them to see if there are any differences.
>
> Thanks, and bye,
> Andrea
>
>
--
Damien Le Moal
Western Digital Research
[Sorry for the duplicate message, it didn't reach the lists due to
html formatting]
Il giorno gio 7 nov 2019 alle ore 08:54 Damien Le Moal
<[email protected]> ha scritto:
>
> On 2019/11/07 16:04, Andrea Vai wrote:
> > Il giorno mer, 06/11/2019 alle 22.13 +0000, Damien Le Moal ha scritto:
> >>
> >>
> >> Please simply try your write tests after doing this:
> >>
> >> echo mq-deadline > /sys/block/<name of your USB
> >> disk>/queue/scheduler
> >>
> >> And confirm that mq-deadline is selected with:
> >>
> >> cat /sys/block/<name of your USB disk>/queue/scheduler
> >> [mq-deadline] kyber bfq none
> >
> > ok, which kernel should I test with this: the fresh git cloned, or the
> > one just patched with Alan's patch, or doesn't matter which one?
>
> Probably all of them to see if there are any differences.
with both kernels, the output of
cat /sys/block/sdh/queue/schedule
already contains [mq-deadline]: is it correct to assume that the echo
command and the subsequent testing is useless? What to do now?
Thanks, and bye
Andrea
On Thu, Nov 07, 2019 at 07:59:44PM +0100, Andrea Vai wrote:
> [Sorry for the duplicate message, it didn't reach the lists due to
> html formatting]
> Il giorno gio 7 nov 2019 alle ore 08:54 Damien Le Moal
> <[email protected]> ha scritto:
> >
> > On 2019/11/07 16:04, Andrea Vai wrote:
> > > Il giorno mer, 06/11/2019 alle 22.13 +0000, Damien Le Moal ha scritto:
> > >>
> > >>
> > >> Please simply try your write tests after doing this:
> > >>
> > >> echo mq-deadline > /sys/block/<name of your USB
> > >> disk>/queue/scheduler
> > >>
> > >> And confirm that mq-deadline is selected with:
> > >>
> > >> cat /sys/block/<name of your USB disk>/queue/scheduler
> > >> [mq-deadline] kyber bfq none
> > >
> > > ok, which kernel should I test with this: the fresh git cloned, or the
> > > one just patched with Alan's patch, or doesn't matter which one?
> >
> > Probably all of them to see if there are any differences.
>
> with both kernels, the output of
> cat /sys/block/sdh/queue/schedule
>
> already contains [mq-deadline]: is it correct to assume that the echo
> command and the subsequent testing is useless? What to do now?
Another thing we could try is to use 'none' via the following command:
echo none > /sys/block/sdh/queue/scheduler #suppose 'sdh' points to the usb storage disk
Because USB storage HBA is single hw queue, which depth is 1. This way
should change to dispatch IO in the order of bio submission.
Andrea, could you switch io scheduler to none and update us if difference
can be made?
Thanks,
Ming
Il giorno dom, 10/11/2019 alle 06.28 +0800, Ming Lei ha scritto:
> On Thu, Nov 07, 2019 at 07:59:44PM +0100, Andrea Vai wrote:
> > [Sorry for the duplicate message, it didn't reach the lists due to
> > html formatting]
> > Il giorno gio 7 nov 2019 alle ore 08:54 Damien Le Moal
> > <[email protected]> ha scritto:
> > >
> > > On 2019/11/07 16:04, Andrea Vai wrote:
> > > > Il giorno mer, 06/11/2019 alle 22.13 +0000, Damien Le Moal ha
> scritto:
> > > >>
> > > >>
> > > >> Please simply try your write tests after doing this:
> > > >>
> > > >> echo mq-deadline > /sys/block/<name of your USB
> > > >> disk>/queue/scheduler
> > > >>
> > > >> And confirm that mq-deadline is selected with:
> > > >>
> > > >> cat /sys/block/<name of your USB disk>/queue/scheduler
> > > >> [mq-deadline] kyber bfq none
> > > >
> > > > ok, which kernel should I test with this: the fresh git
> cloned, or the
> > > > one just patched with Alan's patch, or doesn't matter which
> one?
> > >
> > > Probably all of them to see if there are any differences.
> >
> > with both kernels, the output of
> > cat /sys/block/sdh/queue/schedule
> >
> > already contains [mq-deadline]: is it correct to assume that the
> echo
> > command and the subsequent testing is useless? What to do now?
>
> Another thing we could try is to use 'none' via the following
> command:
>
> echo none > /sys/block/sdh/queue/scheduler #suppose 'sdh' points
> to the usb storage disk
>
> Because USB storage HBA is single hw queue, which depth is 1. This
> way
> should change to dispatch IO in the order of bio submission.
>
> Andrea, could you switch io scheduler to none and update us if
> difference
> can be made?
Of course I would to it, but I see that with the "good" kernel the
output of "cat /sys/block/sdf/queue/scheduler" (yes, now it's sdf) is
noop deadline [cfq]
, i.e. it doesn't show "none". Does it matter? (sorry if it's a silly
question)
Thanks, and bye
Andrea
On Mon, Nov 11, 2019 at 11:50:49AM +0100, Andrea Vai wrote:
> Il giorno dom, 10/11/2019 alle 06.28 +0800, Ming Lei ha scritto:
> > On Thu, Nov 07, 2019 at 07:59:44PM +0100, Andrea Vai wrote:
> > > [Sorry for the duplicate message, it didn't reach the lists due to
> > > html formatting]
> > > Il giorno gio 7 nov 2019 alle ore 08:54 Damien Le Moal
> > > <[email protected]> ha scritto:
> > > >
> > > > On 2019/11/07 16:04, Andrea Vai wrote:
> > > > > Il giorno mer, 06/11/2019 alle 22.13 +0000, Damien Le Moal ha
> > scritto:
> > > > >>
> > > > >>
> > > > >> Please simply try your write tests after doing this:
> > > > >>
> > > > >> echo mq-deadline > /sys/block/<name of your USB
> > > > >> disk>/queue/scheduler
> > > > >>
> > > > >> And confirm that mq-deadline is selected with:
> > > > >>
> > > > >> cat /sys/block/<name of your USB disk>/queue/scheduler
> > > > >> [mq-deadline] kyber bfq none
> > > > >
> > > > > ok, which kernel should I test with this: the fresh git
> > cloned, or the
> > > > > one just patched with Alan's patch, or doesn't matter which
> > one?
> > > >
> > > > Probably all of them to see if there are any differences.
> > >
> > > with both kernels, the output of
> > > cat /sys/block/sdh/queue/schedule
> > >
> > > already contains [mq-deadline]: is it correct to assume that the
> > echo
> > > command and the subsequent testing is useless? What to do now?
> >
> > Another thing we could try is to use 'none' via the following
> > command:
> >
> > echo none > /sys/block/sdh/queue/scheduler #suppose 'sdh' points
> > to the usb storage disk
> >
> > Because USB storage HBA is single hw queue, which depth is 1. This
> > way
> > should change to dispatch IO in the order of bio submission.
> >
> > Andrea, could you switch io scheduler to none and update us if
> > difference
> > can be made?
>
> Of course I would to it, but I see that with the "good" kernel the
> output of "cat /sys/block/sdf/queue/scheduler" (yes, now it's sdf) is
>
> noop deadline [cfq]
Not sure if cfq makes a difference, and I guess you may get same result
with noop or deadline. However, if you only see good write performance with
cfq, you may try 'bfq' and see if it works as cfq.
>
> , i.e. it doesn't show "none". Does it matter? (sorry if it's a silly
> question)
We are talking about new kernel in which there can't be 'noop deadline [cfq]'
any more. And you should see the following output from '/sys/block/sdf/queue/scheduler'
in the new kernel:
[mq-deadline] kyber bfq none
thanks,
Ming
Il giorno lun, 11/11/2019 alle 19.05 +0800, Ming Lei ha scritto:
> On Mon, Nov 11, 2019 at 11:50:49AM +0100, Andrea Vai wrote:
> > Il giorno dom, 10/11/2019 alle 06.28 +0800, Ming Lei ha scritto:
> > > On Thu, Nov 07, 2019 at 07:59:44PM +0100, Andrea Vai wrote:
> > > > [Sorry for the duplicate message, it didn't reach the lists
> due to
> > > > html formatting]
> > > > Il giorno gio 7 nov 2019 alle ore 08:54 Damien Le Moal
> > > > <[email protected]> ha scritto:
> > > > >
> > > > > On 2019/11/07 16:04, Andrea Vai wrote:
> > > > > > Il giorno mer, 06/11/2019 alle 22.13 +0000, Damien Le Moal
> ha
> > > scritto:
> > > > > >>
> > > > > >>
> > > > > >> Please simply try your write tests after doing this:
> > > > > >>
> > > > > >> echo mq-deadline > /sys/block/<name of your USB
> > > > > >> disk>/queue/scheduler
> > > > > >>
> > > > > >> And confirm that mq-deadline is selected with:
> > > > > >>
> > > > > >> cat /sys/block/<name of your USB disk>/queue/scheduler
> > > > > >> [mq-deadline] kyber bfq none
> > > > > >
> > > > > > ok, which kernel should I test with this: the fresh git
> > > cloned, or the
> > > > > > one just patched with Alan's patch, or doesn't matter
> which
> > > one?
> > > > >
> > > > > Probably all of them to see if there are any differences.
> > > >
> > > > with both kernels, the output of
> > > > cat /sys/block/sdh/queue/schedule
> > > >
> > > > already contains [mq-deadline]: is it correct to assume that
> the
> > > echo
> > > > command and the subsequent testing is useless? What to do now?
> > >
> > > Another thing we could try is to use 'none' via the following
> > > command:
> > >
> > > echo none > /sys/block/sdh/queue/scheduler #suppose 'sdh'
> points
> > > to the usb storage disk
> > >
> > > Because USB storage HBA is single hw queue, which depth is 1.
> This
> > > way
> > > should change to dispatch IO in the order of bio submission.
> > >
> > > Andrea, could you switch io scheduler to none and update us if
> > > difference
> > > can be made?
> >
> > Of course I would to it, but I see that with the "good" kernel the
> > output of "cat /sys/block/sdf/queue/scheduler" (yes, now it's sdf)
> is
> >
> > noop deadline [cfq]
>
> Not sure if cfq makes a difference, and I guess you may get same
> result
> with noop or deadline. However, if you only see good write
> performance with
> cfq, you may try 'bfq' and see if it works as cfq.
>
> >
> > , i.e. it doesn't show "none". Does it matter? (sorry if it's a
> silly
> > question)
>
> We are talking about new kernel in which there can't be 'noop
> deadline [cfq]'
> any more. And you should see the following output from
> '/sys/block/sdf/queue/scheduler'
> in the new kernel:
>
> [mq-deadline] kyber bfq none
>
>
ok sorry I misunderstood, assumed you wanted me to compare the "none"
setting in the old kernel with the same setting in the new kernel. Now
it's clear to me that you want me to compare the different scheduler
settings in the new kernel.
Thanks, and bye
Andrea
Il giorno dom, 10/11/2019 alle 06.28 +0800, Ming Lei ha scritto:
> Another thing we could try is to use 'none' via the following
> command:
>
> echo none > /sys/block/sdh/queue/scheduler #suppose 'sdh' points
> to the usb storage disk
>
> Because USB storage HBA is single hw queue, which depth is 1. This
> way
> should change to dispatch IO in the order of bio submission.
>
> Andrea, could you switch io scheduler to none and update us if
> difference
> can be made?
Using the new kernel, there is indeed a difference because the time to
copy a file is 1800 seconds with [mq-deadline], and 340 seconds with
[none]. But that is still far away from the old kernel, which performs
the copy of the same file in 76 seconds.
Side notes:
- The numbers above are average values calculated on 100 trials for
each different situation. As previously noticed on this thread, with
the new kernel the times are also very different among the different
trials in the same situation. With the old kernel the standard
deviation on the times in a set of 100 trials is much smaller (to give
some mean/sigma values: m=1800->s=530; m=340->s=131; m=76->s=13; ).
- The size of the transferred file has been 1GB in these trials.
Smaller files don't always give appreciable differences, but if you
want I can also provide those data. Of course, I can also provide the
raw data of each set of trials.
Thanks,
and bye,
Andrea
On Fri, Nov 22, 2019 at 08:16:30PM +0100, Andrea Vai wrote:
> Il giorno dom, 10/11/2019 alle 06.28 +0800, Ming Lei ha scritto:
> > Another thing we could try is to use 'none' via the following
> > command:
> >
> > echo none > /sys/block/sdh/queue/scheduler #suppose 'sdh' points
> > to the usb storage disk
> >
> > Because USB storage HBA is single hw queue, which depth is 1. This
> > way
> > should change to dispatch IO in the order of bio submission.
> >
> > Andrea, could you switch io scheduler to none and update us if
> > difference
> > can be made?
>
> Using the new kernel, there is indeed a difference because the time to
> copy a file is 1800 seconds with [mq-deadline], and 340 seconds with
> [none]. But that is still far away from the old kernel, which performs
> the copy of the same file in 76 seconds.
Please post the log of 'lsusb -v', and I will try to make a patch for
addressing the issue.
thanks,
Ming
Il giorno sab, 23/11/2019 alle 15.28 +0800, Ming Lei ha scritto:
>
> Please post the log of 'lsusb -v', and I will try to make a patch
> for
> addressing the issue.
attached,
Thanks, and bye
Andrea
On Sat, Nov 23, 2019 at 04:44:55PM +0100, Andrea Vai wrote:
> Il giorno sab, 23/11/2019 alle 15.28 +0800, Ming Lei ha scritto:
> >
> > Please post the log of 'lsusb -v', and I will try to make a patch
> > for
> > addressing the issue.
>
> attached,
Please apply the attached patch, and re-build & install & reboot kernel.
This time, please don't switch io scheduler.
Thanks,
Ming
Il giorno lun, 25/11/2019 alle 11.54 +0800, Ming Lei ha scritto:
> On Sat, Nov 23, 2019 at 04:44:55PM +0100, Andrea Vai wrote:
> > Il giorno sab, 23/11/2019 alle 15.28 +0800, Ming Lei ha scritto:
> > >
> > > Please post the log of 'lsusb -v', and I will try to make a
> patch
> > > for
> > > addressing the issue.
> >
> > attached,
>
> Please apply the attached patch, and re-build & install & reboot
> kernel.
>
> This time, please don't switch io scheduler.
# patch -p1 < usb.patch outputs:
(Stripping trailing CRs from patch; use --binary to disable.)
patching file block/blk-mq.c
Hunk #1 succeeded at 1465 (offset 29 lines).
Hunk #2 succeeded at 3061 (offset 13 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file drivers/scsi/scsi_lib.c
Hunk #1 succeeded at 1902 (offset -37 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file drivers/usb/storage/scsiglue.c
Hunk #1 succeeded at 651 (offset -10 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file include/linux/blk-mq.h
Hunk #1 succeeded at 226 (offset -162 lines).
(Stripping trailing CRs from patch; use --binary to disable.)
patching file include/scsi/scsi_host.h
patch unexpectedly ends in middle of line
patch unexpectedly ends in middle of line
Just to be sure I have to go on, is this correct? Sounds like an error
but I don't know if it is important.
Thanks,
Andrea
On Mon, Nov 25, 2019 at 11:11:00AM +0100, Andrea Vai wrote:
> Il giorno lun, 25/11/2019 alle 11.54 +0800, Ming Lei ha scritto:
> > On Sat, Nov 23, 2019 at 04:44:55PM +0100, Andrea Vai wrote:
> > > Il giorno sab, 23/11/2019 alle 15.28 +0800, Ming Lei ha scritto:
> > > >
> > > > Please post the log of 'lsusb -v', and I will try to make a
> > patch
> > > > for
> > > > addressing the issue.
> > >
> > > attached,
> >
> > Please apply the attached patch, and re-build & install & reboot
> > kernel.
> >
> > This time, please don't switch io scheduler.
>
> # patch -p1 < usb.patch outputs:
>
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file block/blk-mq.c
> Hunk #1 succeeded at 1465 (offset 29 lines).
> Hunk #2 succeeded at 3061 (offset 13 lines).
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file drivers/scsi/scsi_lib.c
> Hunk #1 succeeded at 1902 (offset -37 lines).
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file drivers/usb/storage/scsiglue.c
> Hunk #1 succeeded at 651 (offset -10 lines).
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file include/linux/blk-mq.h
> Hunk #1 succeeded at 226 (offset -162 lines).
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file include/scsi/scsi_host.h
> patch unexpectedly ends in middle of line
> patch unexpectedly ends in middle of line
>
> Just to be sure I have to go on, is this correct? Sounds like an error
> but I don't know if it is important.
Looks there is small conflict, however it has been fixed by patch, so
it is correct, please go on your test.
Thanks,
Ming
Il giorno lun, 25/11/2019 alle 18.29 +0800, Ming Lei ha scritto:
> On Mon, Nov 25, 2019 at 11:11:00AM +0100, Andrea Vai wrote:
> > Il giorno lun, 25/11/2019 alle 11.54 +0800, Ming Lei ha scritto:
> > > On Sat, Nov 23, 2019 at 04:44:55PM +0100, Andrea Vai wrote:
> > > > Il giorno sab, 23/11/2019 alle 15.28 +0800, Ming Lei ha
> scritto:
> > > > >
> > > > > Please post the log of 'lsusb -v', and I will try to make a
> > > patch
> > > > > for
> > > > > addressing the issue.
> > > >
> > > > attached,
> > >
> > > Please apply the attached patch, and re-build & install & reboot
> > > kernel.
> > >
> > > This time, please don't switch io scheduler.
> >
> > # patch -p1 < usb.patch outputs:
> >
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file block/blk-mq.c
> > Hunk #1 succeeded at 1465 (offset 29 lines).
> > Hunk #2 succeeded at 3061 (offset 13 lines).
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file drivers/scsi/scsi_lib.c
> > Hunk #1 succeeded at 1902 (offset -37 lines).
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file drivers/usb/storage/scsiglue.c
> > Hunk #1 succeeded at 651 (offset -10 lines).
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file include/linux/blk-mq.h
> > Hunk #1 succeeded at 226 (offset -162 lines).
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file include/scsi/scsi_host.h
> > patch unexpectedly ends in middle of line
> > patch unexpectedly ends in middle of line
> >
> > Just to be sure I have to go on, is this correct? Sounds like an
> error
> > but I don't know if it is important.
>
> Looks there is small conflict, however it has been fixed by patch,
> so
> it is correct, please go on your test.
Done, it still fails (2000 seconds or more to copy 1GB) :-(
cat /sys/block/sdf/queue/scheduler outputs:
[mq-deadline] none
What to try next?
Thanks,
Andrea
On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> Il giorno lun, 25/11/2019 alle 18.29 +0800, Ming Lei ha scritto:
> > On Mon, Nov 25, 2019 at 11:11:00AM +0100, Andrea Vai wrote:
> > > Il giorno lun, 25/11/2019 alle 11.54 +0800, Ming Lei ha scritto:
> > > > On Sat, Nov 23, 2019 at 04:44:55PM +0100, Andrea Vai wrote:
> > > > > Il giorno sab, 23/11/2019 alle 15.28 +0800, Ming Lei ha
> > scritto:
> > > > > >
> > > > > > Please post the log of 'lsusb -v', and I will try to make a
> > > > patch
> > > > > > for
> > > > > > addressing the issue.
> > > > >
> > > > > attached,
> > > >
> > > > Please apply the attached patch, and re-build & install & reboot
> > > > kernel.
> > > >
> > > > This time, please don't switch io scheduler.
> > >
> > > # patch -p1 < usb.patch outputs:
> > >
> > > (Stripping trailing CRs from patch; use --binary to disable.)
> > > patching file block/blk-mq.c
> > > Hunk #1 succeeded at 1465 (offset 29 lines).
> > > Hunk #2 succeeded at 3061 (offset 13 lines).
> > > (Stripping trailing CRs from patch; use --binary to disable.)
> > > patching file drivers/scsi/scsi_lib.c
> > > Hunk #1 succeeded at 1902 (offset -37 lines).
> > > (Stripping trailing CRs from patch; use --binary to disable.)
> > > patching file drivers/usb/storage/scsiglue.c
> > > Hunk #1 succeeded at 651 (offset -10 lines).
> > > (Stripping trailing CRs from patch; use --binary to disable.)
> > > patching file include/linux/blk-mq.h
> > > Hunk #1 succeeded at 226 (offset -162 lines).
> > > (Stripping trailing CRs from patch; use --binary to disable.)
> > > patching file include/scsi/scsi_host.h
> > > patch unexpectedly ends in middle of line
> > > patch unexpectedly ends in middle of line
> > >
> > > Just to be sure I have to go on, is this correct? Sounds like an
> > error
> > > but I don't know if it is important.
> >
> > Looks there is small conflict, however it has been fixed by patch,
> > so
> > it is correct, please go on your test.
>
> Done, it still fails (2000 seconds or more to copy 1GB) :-(
>
> cat /sys/block/sdf/queue/scheduler outputs:
> [mq-deadline] none
>
> What to try next?
1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
note: replace $DISK with disk name of your usb drive, such as, if it is
/dev/sdb, pass $DISK as sdb.
2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your copy 1GB
test again.
Thanks,
Ming
Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto:
> On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
>
> [...]
>
> > What to try next?
>
> 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
result:
alloc_policy=FIFO SHOULD_MERGE|2
>
>
> 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your copy
> 1GB
> test again.
done, and still fails. What to try next?
Thanks,
Andrea
On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto:
> > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> >
> > [...]
> >
> > > What to try next?
> >
> > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> result:
>
> alloc_policy=FIFO SHOULD_MERGE|2
>
> >
> >
> > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your copy
> > 1GB
> > test again.
>
> done, and still fails. What to try next?
I just run 256M cp test to one USB storage device on patched kernel,
and WRITE data IO is really in ascending order. The filesystem is ext4,
and mount without '-o sync'. From previous discussion, looks that is
exactly your test setting. The order can be observed via the following script:
#!/bin/sh
MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))
/usr/share/bcc/tools/trace -t -C \
't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'
$MAJ & $MIN can be retrieved via lsblk for your USB storage disk.
So I think we need to check if the patch is applied correctly first.
If your kernel tree is managed via git, please post 'git diff'.
Otherwise, share us your kernel version, and I will send you one
backported patch on the kernel version.
Meantime, you can collect IO order log via the above script as you did last
time, then send us the log.
Thanks,
Ming
Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto:
> On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto:
> > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> > >
> > > [...]
> > >
> > > > What to try next?
> > >
> > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> > result:
> >
> > alloc_policy=FIFO SHOULD_MERGE|2
> >
> > >
> > >
> > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your
> copy
> > > 1GB
> > > test again.
> >
> > done, and still fails. What to try next?
>
> I just run 256M cp test
I would like to point out that 256MB is a filesize that usually don't
trigger the issue (don't know if it matters, sorry).
Another info I would provide is about another strange behavior I
noticed: yesterday I ran the test two times (as usual with 1GB
filesize) and took 2370s, 1786s, and a third test was going on when I
stopped it. Then I started another set of 100 trials and let them run
tonight, and the first 10 trials were around 1000s, then gradually
decreased to ~300s, and finally settled around 200s with some trials
below 70-80s. This to say, times are extremely variable and for the
first time I noticed a sort of "performance increase" with time.
> to one USB storage device on patched kernel,
> and WRITE data IO is really in ascending order. The filesystem is
> ext4,
> and mount without '-o sync'. From previous discussion, looks that is
> exactly your test setting. The order can be observed via the
> following script:
>
> #!/bin/sh
> MAJ=$1
> MIN=$2
> MAJ=$(( $MAJ << 20 ))
> DEV=$(( $MAJ | $MIN ))
> /usr/share/bcc/tools/trace -t -C \
> 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> >rwbs, args->sector, args->nr_sector'
>
> $MAJ & $MIN can be retrieved via lsblk for your USB storage disk.
>
> So I think we need to check if the patch is applied correctly first.
>
> If your kernel tree is managed via git,
yes it is,
> please post 'git diff'.
attached. Is it correctly patched? thanks.
> Otherwise, share us your kernel version,
btw, is 5.4.0+
> and I will send you one
> backported patch on the kernel version.
>
> Meantime, you can collect IO order log via the above script as you
> did last
> time, then send us the log.
ok, will try; is it just required to run it for a short period of time
(say, some seconds) during the copy, or should I run it before the
beginning (or before the mount?), and terminate it after the end of
the copy? (Please note that in the latter case a large amount of time
(and data, I suppose) would be involved, because, as said, to be sure
the problem triggers I have to use a large file... but we can try to
better understand and tune this. If it can help, you can get an ods
file with the complete statistic at [1] (look at the "prove_nov19"
sheet)).
Thanks,
Andrea
[1]: http://fisica.unipv.it/transfer/kernelstats.zip
On Tue, Nov 26, 2019 at 08:46:07AM +0100, Andrea Vai wrote:
> Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto:
> > On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> > > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto:
> > > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> > > >
> > > > [...]
> > > >
> > > > > What to try next?
> > > >
> > > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> > > result:
> > >
> > > alloc_policy=FIFO SHOULD_MERGE|2
> > >
> > > >
> > > >
> > > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your
> > copy
> > > > 1GB
> > > > test again.
> > >
> > > done, and still fails. What to try next?
> >
> > I just run 256M cp test
>
> I would like to point out that 256MB is a filesize that usually don't
> trigger the issue (don't know if it matters, sorry).
OK.
I tested 256M because IO timeout is often triggered in case of
qemu-ehci, and it is a long-term issue. When setting up the disk
via xhci-qemu, the max request size is increased to 1MB from 120KB,
and IO pattern changed too. When the disk is connected via uhci-qemu,
the transfer is too slow(1MB/s) because max endpoint size is too small.
However, I just waited 16min and collected all the 1GB IO log by
connecting disk over uhci-qemu, but the sector of each data IO
is still in order.
>
> Another info I would provide is about another strange behavior I
> noticed: yesterday I ran the test two times (as usual with 1GB
> filesize) and took 2370s, 1786s, and a third test was going on when I
> stopped it. Then I started another set of 100 trials and let them run
> tonight, and the first 10 trials were around 1000s, then gradually
> decreased to ~300s, and finally settled around 200s with some trials
> below 70-80s. This to say, times are extremely variable and for the
> first time I noticed a sort of "performance increase" with time.
The 'cp' test is buffered IO, can you reproduce it every time by
running copy just after fresh mount on the USB disk?
>
> > to one USB storage device on patched kernel,
> > and WRITE data IO is really in ascending order. The filesystem is
> > ext4,
> > and mount without '-o sync'. From previous discussion, looks that is
> > exactly your test setting. The order can be observed via the
> > following script:
> >
> > #!/bin/sh
> > MAJ=$1
> > MIN=$2
> > MAJ=$(( $MAJ << 20 ))
> > DEV=$(( $MAJ | $MIN ))
> > /usr/share/bcc/tools/trace -t -C \
> > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> > >rwbs, args->sector, args->nr_sector'
> >
> > $MAJ & $MIN can be retrieved via lsblk for your USB storage disk.
> >
> > So I think we need to check if the patch is applied correctly first.
> >
> > If your kernel tree is managed via git,
> yes it is,
>
> > please post 'git diff'.
> attached. Is it correctly patched? thanks.
Yeah, it should be correct except for the change on __blk_mq_delay_run_hw_queue()
is duplicated.
>
>
> > Otherwise, share us your kernel version,
> btw, is 5.4.0+
>
> > and I will send you one
> > backported patch on the kernel version.
> >
> > Meantime, you can collect IO order log via the above script as you
> > did last
> > time, then send us the log.
>
> ok, will try; is it just required to run it for a short period of time
> (say, some seconds) during the copy, or should I run it before the
> beginning (or before the mount?), and terminate it after the end of
> the copy? (Please note that in the latter case a large amount of time
> (and data, I suppose) would be involved, because, as said, to be sure
> the problem triggers I have to use a large file... but we can try to
> better understand and tune this. If it can help, you can get an ods
> file with the complete statistic at [1] (look at the "prove_nov19"
> sheet)).
The data won't be very big, each line covers 120KB, and ~10K line
is enough for cover 1GB transfer. Then ~300KB compressed file should
hold all the trace.
Thanks,
Ming
On Tue, Nov 26, 2019 at 05:15:33PM +0800, Ming Lei wrote:
> On Tue, Nov 26, 2019 at 08:46:07AM +0100, Andrea Vai wrote:
> > Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto:
> > > On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> > > > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto:
> > > > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > What to try next?
> > > > >
> > > > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> > > > result:
> > > >
> > > > alloc_policy=FIFO SHOULD_MERGE|2
> > > >
> > > > >
> > > > >
> > > > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your
> > > copy
> > > > > 1GB
> > > > > test again.
> > > >
> > > > done, and still fails. What to try next?
> > >
> > > I just run 256M cp test
> >
> > I would like to point out that 256MB is a filesize that usually don't
> > trigger the issue (don't know if it matters, sorry).
>
> OK.
>
> I tested 256M because IO timeout is often triggered in case of
> qemu-ehci, and it is a long-term issue. When setting up the disk
> via xhci-qemu, the max request size is increased to 1MB from 120KB,
> and IO pattern changed too. When the disk is connected via uhci-qemu,
> the transfer is too slow(1MB/s) because max endpoint size is too small.
>
> However, I just waited 16min and collected all the 1GB IO log by
> connecting disk over uhci-qemu, but the sector of each data IO
> is still in order.
>
> >
> > Another info I would provide is about another strange behavior I
> > noticed: yesterday I ran the test two times (as usual with 1GB
> > filesize) and took 2370s, 1786s, and a third test was going on when I
> > stopped it. Then I started another set of 100 trials and let them run
> > tonight, and the first 10 trials were around 1000s, then gradually
> > decreased to ~300s, and finally settled around 200s with some trials
> > below 70-80s. This to say, times are extremely variable and for the
> > first time I noticed a sort of "performance increase" with time.
>
> The 'cp' test is buffered IO, can you reproduce it every time by
> running copy just after fresh mount on the USB disk?
>
> >
> > > to one USB storage device on patched kernel,
> > > and WRITE data IO is really in ascending order. The filesystem is
> > > ext4,
> > > and mount without '-o sync'. From previous discussion, looks that is
> > > exactly your test setting. The order can be observed via the
> > > following script:
> > >
> > > #!/bin/sh
> > > MAJ=$1
> > > MIN=$2
> > > MAJ=$(( $MAJ << 20 ))
> > > DEV=$(( $MAJ | $MIN ))
> > > /usr/share/bcc/tools/trace -t -C \
> > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> > > >rwbs, args->sector, args->nr_sector'
> > >
> > > $MAJ & $MIN can be retrieved via lsblk for your USB storage disk.
> > >
> > > So I think we need to check if the patch is applied correctly first.
> > >
> > > If your kernel tree is managed via git,
> > yes it is,
> >
> > > please post 'git diff'.
> > attached. Is it correctly patched? thanks.
>
> Yeah, it should be correct except for the change on __blk_mq_delay_run_hw_queue()
> is duplicated.
>
> >
> >
> > > Otherwise, share us your kernel version,
> > btw, is 5.4.0+
> >
> > > and I will send you one
> > > backported patch on the kernel version.
> > >
> > > Meantime, you can collect IO order log via the above script as you
> > > did last
> > > time, then send us the log.
> >
> > ok, will try; is it just required to run it for a short period of time
> > (say, some seconds) during the copy, or should I run it before the
> > beginning (or before the mount?), and terminate it after the end of
> > the copy? (Please note that in the latter case a large amount of time
> > (and data, I suppose) would be involved, because, as said, to be sure
> > the problem triggers I have to use a large file... but we can try to
> > better understand and tune this. If it can help, you can get an ods
> > file with the complete statistic at [1] (look at the "prove_nov19"
> > sheet)).
>
> The data won't be very big, each line covers 120KB, and ~10K line
> is enough for cover 1GB transfer. Then ~300KB compressed file should
> hold all the trace.
Also use the following trace script this time:
#!/bin/sh
MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))
/usr/share/bcc/tools/trace -t -C \
't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'
Thanks,
Ming
Il giorno mar, 26/11/2019 alle 17.15 +0800, Ming Lei ha scritto:
> On Tue, Nov 26, 2019 at 08:46:07AM +0100, Andrea Vai wrote:
> > Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto:
> > > On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> > > > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha
> scritto:
> > > > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > What to try next?
> > > > >
> > > > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> > > > result:
> > > >
> > > > alloc_policy=FIFO SHOULD_MERGE|2
> > > >
> > > > >
> > > > >
> > > > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run
> your
> > > copy
> > > > > 1GB
> > > > > test again.
> > > >
> > > > done, and still fails. What to try next?
> > >
> > > I just run 256M cp test
> >
> > I would like to point out that 256MB is a filesize that usually
> don't
> > trigger the issue (don't know if it matters, sorry).
>
> OK.
>
> I tested 256M because IO timeout is often triggered in case of
> qemu-ehci, and it is a long-term issue. When setting up the disk
> via xhci-qemu, the max request size is increased to 1MB from 120KB,
> and IO pattern changed too. When the disk is connected via uhci-
> qemu,
> the transfer is too slow(1MB/s) because max endpoint size is too
> small.
>
> However, I just waited 16min and collected all the 1GB IO log by
> connecting disk over uhci-qemu, but the sector of each data IO
> is still in order.
>
> >
> > Another info I would provide is about another strange behavior I
> > noticed: yesterday I ran the test two times (as usual with 1GB
> > filesize) and took 2370s, 1786s, and a third test was going on
> when I
> > stopped it. Then I started another set of 100 trials and let them
> run
> > tonight, and the first 10 trials were around 1000s, then gradually
> > decreased to ~300s, and finally settled around 200s with some
> trials
> > below 70-80s. This to say, times are extremely variable and for
> the
> > first time I noticed a sort of "performance increase" with time.
>
> The 'cp' test is buffered IO, can you reproduce it every time by
> running copy just after fresh mount on the USB disk?
yes, every time my test script (attached) mounts, copy, unmount (but I
don't unplug and replug the pendrive each time). Is this enough?
>
> >
> > > to one USB storage device on patched kernel,
> > > and WRITE data IO is really in ascending order. The filesystem
> is
> > > ext4,
> > > and mount without '-o sync'. From previous discussion, looks
> that is
> > > exactly your test setting. The order can be observed via the
> > > following script:
> > >
> > > #!/bin/sh
> > > MAJ=$1
> > > MIN=$2
> > > MAJ=$(( $MAJ << 20 ))
> > > DEV=$(( $MAJ | $MIN ))
> > > /usr/share/bcc/tools/trace -t -C \
> > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d",
> args-
> > > >rwbs, args->sector, args->nr_sector'
> > >
> > > $MAJ & $MIN can be retrieved via lsblk for your USB storage
> disk.
ok, so I try:
# lsblk /dev/sdf
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdf 8:80 1 28,8G 0 disk
└─sdf1 8:81 1 28,8G 0 part
so I ran your script (the second one, which you sent me in the next
email message) with:
./test_ming 8 80
but it fails to run (terminal output is in attached errors.txt).
What am I doing wrong?
It's still not clear to me if I need to start the trace script and
then the test, or the opposite (or doesn't matter). The above errors
are in the former case (I didn't even start the test, actually)
Thanks,
Andrea
On Tue, 26 Nov 2019, Andrea Vai wrote:
> Then I started another set of 100 trials and let them run tonight, and
> the first 10 trials were around 1000s, then gradually decreased to
> ~300s, and finally settled around 200s with some trials below 70-80s.
> This to say, times are extremely variable and for the first time I
> noticed a sort of "performance increase" with time.
>
The sheer volume of testing (probably some terabytes by now) would
exercise the wear leveling algorithm in the FTL.
This in itself seems unlikely to improve performance significantly. But if
the flash memory came from a bad batch, perhaps it would have that effect.
To find out, someone may need to source another (genuine) Kingston
DataTraveller device.
--
On Tue, Nov 26, 2019 at 12:14:19PM +0100, Andrea Vai wrote:
> Il giorno mar, 26/11/2019 alle 17.15 +0800, Ming Lei ha scritto:
> > On Tue, Nov 26, 2019 at 08:46:07AM +0100, Andrea Vai wrote:
> > > Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto:
> > > > On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote:
> > > > > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha
> > scritto:
> > > > > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote:
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > What to try next?
> > > > > >
> > > > > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags
> > > > > result:
> > > > >
> > > > > alloc_policy=FIFO SHOULD_MERGE|2
> > > > >
> > > > > >
> > > > > >
> > > > > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run
> > your
> > > > copy
> > > > > > 1GB
> > > > > > test again.
> > > > >
> > > > > done, and still fails. What to try next?
> > > >
> > > > I just run 256M cp test
> > >
> > > I would like to point out that 256MB is a filesize that usually
> > don't
> > > trigger the issue (don't know if it matters, sorry).
> >
> > OK.
> >
> > I tested 256M because IO timeout is often triggered in case of
> > qemu-ehci, and it is a long-term issue. When setting up the disk
> > via xhci-qemu, the max request size is increased to 1MB from 120KB,
> > and IO pattern changed too. When the disk is connected via uhci-
> > qemu,
> > the transfer is too slow(1MB/s) because max endpoint size is too
> > small.
> >
> > However, I just waited 16min and collected all the 1GB IO log by
> > connecting disk over uhci-qemu, but the sector of each data IO
> > is still in order.
> >
> > >
> > > Another info I would provide is about another strange behavior I
> > > noticed: yesterday I ran the test two times (as usual with 1GB
> > > filesize) and took 2370s, 1786s, and a third test was going on
> > when I
> > > stopped it. Then I started another set of 100 trials and let them
> > run
> > > tonight, and the first 10 trials were around 1000s, then gradually
> > > decreased to ~300s, and finally settled around 200s with some
> > trials
> > > below 70-80s. This to say, times are extremely variable and for
> > the
> > > first time I noticed a sort of "performance increase" with time.
> >
> > The 'cp' test is buffered IO, can you reproduce it every time by
> > running copy just after fresh mount on the USB disk?
>
> yes, every time my test script (attached) mounts, copy, unmount (but I
> don't unplug and replug the pendrive each time). Is this enough?
>
> >
> > >
> > > > to one USB storage device on patched kernel,
> > > > and WRITE data IO is really in ascending order. The filesystem
> > is
> > > > ext4,
> > > > and mount without '-o sync'. From previous discussion, looks
> > that is
> > > > exactly your test setting. The order can be observed via the
> > > > following script:
> > > >
> > > > #!/bin/sh
> > > > MAJ=$1
> > > > MIN=$2
> > > > MAJ=$(( $MAJ << 20 ))
> > > > DEV=$(( $MAJ | $MIN ))
> > > > /usr/share/bcc/tools/trace -t -C \
> > > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d",
> > args-
> > > > >rwbs, args->sector, args->nr_sector'
> > > >
> > > > $MAJ & $MIN can be retrieved via lsblk for your USB storage
> > disk.
>
> ok, so I try:
>
> # lsblk /dev/sdf
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sdf 8:80 1 28,8G 0 disk
> └─sdf1 8:81 1 28,8G 0 part
>
> so I ran your script (the second one, which you sent me in the next
> email message) with:
>
> ./test_ming 8 80
>
> but it fails to run (terminal output is in attached errors.txt).
> What am I doing wrong?
>
> It's still not clear to me if I need to start the trace script and
> then the test, or the opposite (or doesn't matter). The above errors
> are in the former case (I didn't even start the test, actually)
>
> Thanks,
> Andrea
> In file included from /virtual/main.c:2:
> In file included from /lib/modules/5.4.0+/build/include/linux/ptrace.h:6:
> In file included from /lib/modules/5.4.0+/build/include/linux/sched.h:14:
> In file included from /lib/modules/5.4.0+/build/include/linux/pid.h:5:
> In file included from /lib/modules/5.4.0+/build/include/linux/rculist.h:11:
> In file included from /lib/modules/5.4.0+/build/include/linux/rcupdate.h:26:
> In file included from /lib/modules/5.4.0+/build/include/linux/irqflags.h:16:
> In file included from /lib/modules/5.4.0+/build/arch/x86/include/asm/irqflags.h:9:
> In file included from /lib/modules/5.4.0+/build/arch/x86/include/asm/nospec-branch.h:314:
> /lib/modules/5.4.0+/build/arch/x86/include/asm/segment.h:254:2: error: expected '(' after 'asm'
> alternative_io ("lsl %[seg],%[p]",
> ^
> /lib/modules/5.4.0+/build/arch/x86/include/asm/alternative.h:240:2: note: expanded from macro 'alternative_io'
> asm_inline volatile (ALTERNATIVE(oldinstr, newinstr, feature) \
> ^
> /lib/modules/5.4.0+/build/include/linux/compiler_types.h:210:24: note: expanded from macro 'asm_inline'
> #define asm_inline asm __inline
> ^
> In file included from /virtual/main.c:2:
> In file included from /lib/modules/5.4.0+/build/include/linux/ptrace.h:6:
> In file included from /lib/modules/5.4.0+/build/include/linux/sched.h:14:
> In file included from /lib/modules/5.4.0+/build/include/linux/pid.h:5:
> In file included from /lib/modules/5.4.0+/build/include/linux/rculist.h:11:
> In file included from /lib/modules/5.4.0+/build/include/linux/rcupdate.h:27:
> In file included from /lib/modules/5.4.0+/build/include/linux/preempt.h:78:
> In file included from /lib/modules/5.4.0+/build/arch/x86/include/asm/preempt.h:7:
> In file included from /lib/modules/5.4.0+/build/include/linux/thread_info.h:38:
> In file included from /lib/modules/5.4.0+/build/arch/x86/include/asm/thread_info.h:12:
> In file included from /lib/modules/5.4.0+/build/arch/x86/include/asm/page.h:12:
> /lib/modules/5.4.0+/build/arch/x86/include/asm/page_64.h:49:2: error: expected '(' after 'asm'
> alternative_call_2(clear_page_orig,
> ^
> /lib/modules/5.4.0+/build/arch/x86/include/asm/alternative.h:256:2: note: expanded from macro 'alternative_call_2'
> asm_inline volatile (ALTERNATIVE_2("call %P[old]", "call %P[new1]", feature1,\
> ^
> /lib/modules/5.4.0+/build/include/linux/compiler_types.h:210:24: note: expanded from macro 'asm_inline'
> #define asm_inline asm __inline
It can be workaround via the following change:
/lib/modules/5.4.0+/build/include/generated/autoconf.h:
//#define CONFIG_CC_HAS_ASM_INLINE 1
Thanks,
Ming
>
> > Then I started another set of 100 trials and let them run tonight, and
> > the first 10 trials were around 1000s, then gradually decreased to
> > ~300s, and finally settled around 200s with some trials below 70-80s.
> > This to say, times are extremely variable and for the first time I
> > noticed a sort of "performance increase" with time.
> >
>
> The sheer volume of testing (probably some terabytes by now) would
> exercise the wear leveling algorithm in the FTL.
>
But with "old kernel" the copy operation still is "fast", as far as i understood.
If FTL (e.g. wear leveling) would slow down, we would see that also in
the old kernel, right?
Andrea, can you confirm that the same device used with the old fast
kernel is still fast today?
BR
Carsten
Il giorno mer, 27/11/2019 alle 10.05 +0800, Ming Lei ha scritto:
>
>
> It can be workaround via the following change:
>
> /lib/modules/5.4.0+/build/include/generated/autoconf.h:
>
> //#define CONFIG_CC_HAS_ASM_INLINE 1
Thanks, it worked, trace attached. Produced by: start the trace script
(with the pendrive already plugged), wait some seconds, run the test
(1 trial, 1 GB), wait for the test to finish, stop the trace.
The copy took 2659 seconds, roughly as already seen before.
Thanks,
Andrea
On Wed, Nov 27, 2019 at 10:39:40AM +0100, Andrea Vai wrote:
> Il giorno mer, 27/11/2019 alle 10.05 +0800, Ming Lei ha scritto:
> >
> >
> > It can be workaround via the following change:
> >
> > /lib/modules/5.4.0+/build/include/generated/autoconf.h:
> >
> > //#define CONFIG_CC_HAS_ASM_INLINE 1
>
> Thanks, it worked, trace attached. Produced by: start the trace script
> (with the pendrive already plugged), wait some seconds, run the test
> (1 trial, 1 GB), wait for the test to finish, stop the trace.
>
> The copy took 2659 seconds, roughly as already seen before.
Thanks for collecting the log.
From the log, some of write IOs are out-of-order, such as, the 1st one
is 378880.
16.41240 2 266 266 kworker/2:1H block_rq_issue b'W' 370656 240
16.41961 3 485 485 kworker/3:1H block_rq_issue b'W' 378880 240
16.73729 2 266 266 kworker/2:1H block_rq_issue b'W' 370896 240
17.71161 2 266 266 kworker/2:1H block_rq_issue b'W' 379120 240
18.02344 2 266 266 kworker/2:1H block_rq_issue b'W' 371136 240
18.94314 3 485 485 kworker/3:1H block_rq_issue b'W' 379360 240
19.25624 2 266 266 kworker/2:1H block_rq_issue b'W' 371376 240
IO latency is increased a lot since the 1st out-of-order request(usb
storage HBA is single queue depth, one request can be issued only if
the previous issued request is completed).
The reason is that there are two kind of tasks which inserts rq to device.
One is the 'cp' process, the other is kworker/u8:*. The out-of-order
happens during the two task's interleaving.
Under such situation, I believe that the old legacy IO path may not
guarantee the order too. In blk_queue_bio(), after get_request()
allocates one request, the queue lock is released. And request is
actually inserted & issued from blk_flush_plug_list() under the
branch of 'if (plug)'. If requests are from two tasks, then request
is inserted/issued from two plug list, and no order can be guaranteed.
In my test, except for several requests from the beginning, all other
requests are inserted via the kworker thread(guess it is writeback wq),
that is why I can't observe the issue in my test.
As Schmid suggested, you may run the same test on old kernel with
legacy io path, and see if the performance is still good.
Also, could you share the following info about your machine? So that
I can build my VM guest in this setting for reproducing your situation
(requests are inserted from two types of threads).
- lscpu
- free -h
- lsblk -d $USB_DISK
- exact commands for mount the disk, and running the copy operation
Thanks,
Ming
Il giorno mer, 27/11/2019 alle 21.08 +0800, Ming Lei ha scritto:
> On Wed, Nov 27, 2019 at 10:39:40AM +0100, Andrea Vai wrote:
> > Il giorno mer, 27/11/2019 alle 10.05 +0800, Ming Lei ha scritto:
> > >
> > >
> > > It can be workaround via the following change:
> > >
> > > /lib/modules/5.4.0+/build/include/generated/autoconf.h:
> > >
> > > //#define CONFIG_CC_HAS_ASM_INLINE 1
> >
> > Thanks, it worked, trace attached. Produced by: start the trace
> script
> > (with the pendrive already plugged), wait some seconds, run the
> test
> > (1 trial, 1 GB), wait for the test to finish, stop the trace.
> >
> > The copy took 2659 seconds, roughly as already seen before.
>
> Thanks for collecting the log.
>
> From the log, some of write IOs are out-of-order, such as, the 1st
> one
> is 378880.
>
> 16.41240 2 266 266 kworker/2:1H block_rq_issue b'W'
> 370656 240
> 16.41961 3 485 485 kworker/3:1H block_rq_issue b'W'
> 378880 240
> 16.73729 2 266 266 kworker/2:1H block_rq_issue b'W'
> 370896 240
> 17.71161 2 266 266 kworker/2:1H block_rq_issue b'W'
> 379120 240
> 18.02344 2 266 266 kworker/2:1H block_rq_issue b'W'
> 371136 240
> 18.94314 3 485 485 kworker/3:1H block_rq_issue b'W'
> 379360 240
> 19.25624 2 266 266 kworker/2:1H block_rq_issue b'W'
> 371376 240
>
> IO latency is increased a lot since the 1st out-of-order request(usb
> storage HBA is single queue depth, one request can be issued only
> if
> the previous issued request is completed).
>
> The reason is that there are two kind of tasks which inserts rq to
> device.
> One is the 'cp' process, the other is kworker/u8:*. The out-of-
> order
> happens during the two task's interleaving.
>
> Under such situation, I believe that the old legacy IO path may not
> guarantee the order too. In blk_queue_bio(), after get_request()
> allocates one request, the queue lock is released. And request is
> actually inserted & issued from blk_flush_plug_list() under the
> branch of 'if (plug)'. If requests are from two tasks, then request
> is inserted/issued from two plug list, and no order can be
> guaranteed.
>
> In my test, except for several requests from the beginning, all
> other
> requests are inserted via the kworker thread(guess it is writeback
> wq),
> that is why I can't observe the issue in my test.
>
> As Schmid suggested, you may run the same test on old kernel with
> legacy io path, and see if the performance is still good.
>
> Also, could you share the following info about your machine? So that
> I can build my VM guest in this setting for reproducing your
> situation
> (requests are inserted from two types of threads).
>
> - lscpu
attached,
> - free -h
total used free shared buff/cache available
Mem: 23Gi 4,2Gi 11Gi 448Mi 7,0Gi 18Gi
Swap: 3,7Gi 0B 3,7Gi
> - lsblk -d $USB_DISK
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdg 8:96 1 28,8G 0 disk
> - exact commands for mount the disk, and running the copy operation
I attached the whole script to this thread, I attach it again to this
message and copy the relevant lines here:
mount UUID=$uuid /mnt/pendrive 2>&1 |tee -a $logfile
SECONDS=0
cp $testfile /mnt/pendrive 2>&1 |tee -a $logfile
umount /mnt/pendrive 2>&1 |tee -a $logfile
Meanwhile, I am going on with the further tests as suggested
Thanks,
Andrea
On Wed, 27 Nov 2019, Schmid, Carsten wrote:
> >
> > The sheer volume of testing (probably some terabytes by now) would
> > exercise the wear leveling algorithm in the FTL.
> >
> But with "old kernel" the copy operation still is "fast", as far as i
> understood. If FTL (e.g. wear leveling) would slow down, we would see
> that also in the old kernel, right?
>
> Andrea, can you confirm that the same device used with the old fast
> kernel is still fast today?
You seem to be saying we should optimize the kernel for a pathological
use-case merely because it used to be fast before the blk-mq conversion.
That makes no sense to me. I suppose you have information that I don't.
I assume that your employer (and the other corporations involved in this)
have plenty of regression test results from a variety of flash hardware to
show that the regression is real and the device is not pathological.
I'm not privy to any of that information so I will shut up and leave you
guys to it.
--
> > This in itself seems unlikely to improve performance significantly.
> > But if the flash memory came from a bad batch, perhaps it would have
> > that effect.
> >
> > To find out, someone may need to source another (genuine) Kingston
> > DataTraveller device.
> >
Il giorno mer, 27/11/2019 alle 08.14 +0000, Schmid, Carsten ha
scritto:
> >
> > > Then I started another set of 100 trials and let them run
> tonight, and
> > > the first 10 trials were around 1000s, then gradually decreased
> to
> > > ~300s, and finally settled around 200s with some trials below
> 70-80s.
> > > This to say, times are extremely variable and for the first time
> I
> > > noticed a sort of "performance increase" with time.
> > >
> >
> > The sheer volume of testing (probably some terabytes by now) would
> > exercise the wear leveling algorithm in the FTL.
> >
> But with "old kernel" the copy operation still is "fast", as far as
> i understood.
> If FTL (e.g. wear leveling) would slow down, we would see that also
> in
> the old kernel, right?
>
> Andrea, can you confirm that the same device used with the old fast
> kernel is still fast today?
Yes, it is still fast. Just ran a 100 trials test and got an average
of 70 seconds with standard deviation = 6 seconds, aligned with the
past values of the same kernel.
Thanks,
Andrea
> > > The sheer volume of testing (probably some terabytes by now) would
> > > exercise the wear leveling algorithm in the FTL.
> > >
> > But with "old kernel" the copy operation still is "fast", as far as
> > i understood.
> > If FTL (e.g. wear leveling) would slow down, we would see that also
> > in
> > the old kernel, right?
> >
> > Andrea, can you confirm that the same device used with the old fast
> > kernel is still fast today?
>
> Yes, it is still fast. Just ran a 100 trials test and got an average
> of 70 seconds with standard deviation = 6 seconds, aligned with the
> past values of the same kernel.
>
> Thanks,
> Andrea
I have been involved in several benchmarkings of flash devices in the past.
So what we see here is definitely not a device issue regarding wear leveling.
I wanted to prevent all of you going into the wrong direction, that's why
i wanted Andrea to confirm that it's not a matter of the flash device.
There are so much items involved into benchmarking flash devices.
But Andrea's observations with factors of 10-30 times slow down
i have never seen before.
I assume the only thing that you change between the benchmarks
is the kernel (and the modules, of course), right, Andrea?
Then we can rule out cache settings which massively can impact
benchmarks.
The only thing that makes sense from my POV is:
- collect traces with the kernel before mentioned commit (fast)
- apply patch in doubt
- again collect traces (slow)
- compare the traces
Then we should be able to see the difference(s).
Unfortunately i'm not an expert on the SCSI and USB kernel stuff
involved here. Else i would try to understand what happens and
give you some hints.
BR
Carsten
On Thu, Nov 28, 2019 at 08:46:57AM +0100, Andrea Vai wrote:
> Il giorno mer, 27/11/2019 alle 08.14 +0000, Schmid, Carsten ha
> scritto:
> > >
> > > > Then I started another set of 100 trials and let them run
> > tonight, and
> > > > the first 10 trials were around 1000s, then gradually decreased
> > to
> > > > ~300s, and finally settled around 200s with some trials below
> > 70-80s.
> > > > This to say, times are extremely variable and for the first time
> > I
> > > > noticed a sort of "performance increase" with time.
> > > >
> > >
> > > The sheer volume of testing (probably some terabytes by now) would
> > > exercise the wear leveling algorithm in the FTL.
> > >
> > But with "old kernel" the copy operation still is "fast", as far as
> > i understood.
> > If FTL (e.g. wear leveling) would slow down, we would see that also
> > in
> > the old kernel, right?
> >
> > Andrea, can you confirm that the same device used with the old fast
> > kernel is still fast today?
>
> Yes, it is still fast. Just ran a 100 trials test and got an average
> of 70 seconds with standard deviation = 6 seconds, aligned with the
> past values of the same kernel.
Then can you collect trace on the old kernel via the previous script?
#!/bin/sh
MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))
/usr/share/bcc/tools/trace -t -C \
't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'
Both the two trace points and bcc should be available on the old kernel.
Thanks,
Ming
Il giorno gio, 28/11/2019 alle 08.12 +0000, Schmid, Carsten ha
scritto:
>
> [...]
>
> I assume the only thing that you change between the benchmarks
> is the kernel (and the modules, of course), right, Andrea?
>
It's my production machine, so apart from the changes involved in a
"normal use of a PC" I can say that there are no changes I am aware of
(apart from the kernel, and other changes you told me to do, such as
changing the IO scheduler, etc)... but please remember I am not an
expert, so feel free to ask me what other kind of changes I can tell
you about.
Thanks,
Andrea
Il giorno mer, 27/11/2019 alle 11.21 +1100, Finn Thain ha scritto:
> On Tue, 26 Nov 2019, Andrea Vai wrote:
>
> > Then I started another set of 100 trials and let them run tonight,
> and
> > the first 10 trials were around 1000s, then gradually decreased
> to
> > ~300s, and finally settled around 200s with some trials below 70-
> 80s.
> > This to say, times are extremely variable and for the first time
> I
> > noticed a sort of "performance increase" with time.
> >
>
> The sheer volume of testing (probably some terabytes by now) would
> exercise the wear leveling algorithm in the FTL.
>
> This in itself seems unlikely to improve performance significantly.
> But if
> the flash memory came from a bad batch, perhaps it would have that
> effect.
>
> To find out, someone may need to source another (genuine) Kingston
> DataTraveller device.
I own another device (let's refer to it as "black odd"), identical to
the "slow" one (call it "black even"), and used it as well to do the
tests, especially in the beginning of this story, because I suspected
the problem could be related to a faulty pen drive. At a certain time
I realized that the tests I performed didn't show any difference
between the two flash drives, so since that time I kept using just the
"black even". They were bought together, so of course both of them
probably belong to the same "maybe-bad batch".
But I have another Kingston DataTraveler ("White"), externally
slightly different from the other twos (it's white instead of black,
and labeled G4 instead of G3), though lsusb shows the same IDs:
0951:1666. It had been purchased some months after the other twos
(well, actually, it may be the result of an RMA exchange).
I have just ran one test on this White one, with the new (patched)
kernel, and it took an average of 200seconds (st.dev=46s), which is
not "good", but less "bad" than the real "bad" case of the "black"
ones (>1000 seconds).
I have also tried the "WHITE" one with the old fast kernel, and the
behavior is almost the same as with the new kernel, though a little
bit better (mean=173; st.dev.=11).
Feel free to let me know if I should do other tries,
thanks,
Andrea
Il giorno gio, 28/11/2019 alle 17.17 +0800, Ming Lei ha scritto:
> On Thu, Nov 28, 2019 at 08:46:57AM +0100, Andrea Vai wrote:
> > Il giorno mer, 27/11/2019 alle 08.14 +0000, Schmid, Carsten ha
> > scritto:
> > > >
> > > > > Then I started another set of 100 trials and let them run
> > > tonight, and
> > > > > the first 10 trials were around 1000s, then gradually
> decreased
> > > to
> > > > > ~300s, and finally settled around 200s with some trials
> below
> > > 70-80s.
> > > > > This to say, times are extremely variable and for the first
> time
> > > I
> > > > > noticed a sort of "performance increase" with time.
> > > > >
> > > >
> > > > The sheer volume of testing (probably some terabytes by now)
> would
> > > > exercise the wear leveling algorithm in the FTL.
> > > >
> > > But with "old kernel" the copy operation still is "fast", as far
> as
> > > i understood.
> > > If FTL (e.g. wear leveling) would slow down, we would see that
> also
> > > in
> > > the old kernel, right?
> > >
> > > Andrea, can you confirm that the same device used with the old
> fast
> > > kernel is still fast today?
> >
> > Yes, it is still fast. Just ran a 100 trials test and got an
> average
> > of 70 seconds with standard deviation = 6 seconds, aligned with
> the
> > past values of the same kernel.
>
> Then can you collect trace on the old kernel via the previous
> script?
>
> #!/bin/sh
>
> MAJ=$1
> MIN=$2
> MAJ=$(( $MAJ << 20 ))
> DEV=$(( $MAJ | $MIN ))
>
> /usr/share/bcc/tools/trace -t -C \
> 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> >rwbs, args->sector, args->nr_sector' \
> 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args-
> >rwbs, args->sector, args->nr_sector'
>
> Both the two trace points and bcc should be available on the old
> kernel.
>
Trace attached. Produced by: start the trace script
(with the pendrive already plugged), wait some seconds, run the test
(1 trial, 1 GB), wait for the test to finish, stop the trace.
The copy took 73 seconds, roughly as already seen before with the fast
old kernel.
Thanks,
Andrea
On Thu, 28 Nov 2019, Schmid, Carsten wrote:
> I have been involved in several benchmarkings of flash devices in the past.
> So what we see here is definitely not a device issue regarding wear leveling.
>
> I wanted to prevent all of you going into the wrong direction, that's why
> i wanted Andrea to confirm that it's not a matter of the flash device.
>
> There are so much items involved into benchmarking flash devices.
> But Andrea's observations with factors of 10-30 times slow down
> i have never seen before.
>
> I assume the only thing that you change between the benchmarks
> is the kernel (and the modules, of course), right, Andrea?
> Then we can rule out cache settings which massively can impact
> benchmarks.
>
> The only thing that makes sense from my POV is:
> - collect traces with the kernel before mentioned commit (fast)
> - apply patch in doubt
> - again collect traces (slow)
> - compare the traces
>
> Then we should be able to see the difference(s).
We have already done this. I forget whether the traces are in the
email history available in the archives or whether they are stored
somewhere else.
In any case, my analysis of the traces is in the archives. It seemed
very clear that the only difference which mattered was the ordering of
the write commands (sequential vs. non-sequential). This was obviously
something which the commit in question would affect, and it also seemed
likely to cause the device to slow down considerably.
Alan Stern
> Unfortunately i'm not an expert on the SCSI and USB kernel stuff
> involved here. Else i would try to understand what happens and
> give you some hints.
>
> BR
> Carsten
On Thu, Nov 28, 2019 at 06:34:32PM +0100, Andrea Vai wrote:
> Il giorno gio, 28/11/2019 alle 17.17 +0800, Ming Lei ha scritto:
> > On Thu, Nov 28, 2019 at 08:46:57AM +0100, Andrea Vai wrote:
> > > Il giorno mer, 27/11/2019 alle 08.14 +0000, Schmid, Carsten ha
> > > scritto:
> > > > >
> > > > > > Then I started another set of 100 trials and let them run
> > > > tonight, and
> > > > > > the first 10 trials were around 1000s, then gradually
> > decreased
> > > > to
> > > > > > ~300s, and finally settled around 200s with some trials
> > below
> > > > 70-80s.
> > > > > > This to say, times are extremely variable and for the first
> > time
> > > > I
> > > > > > noticed a sort of "performance increase" with time.
> > > > > >
> > > > >
> > > > > The sheer volume of testing (probably some terabytes by now)
> > would
> > > > > exercise the wear leveling algorithm in the FTL.
> > > > >
> > > > But with "old kernel" the copy operation still is "fast", as far
> > as
> > > > i understood.
> > > > If FTL (e.g. wear leveling) would slow down, we would see that
> > also
> > > > in
> > > > the old kernel, right?
> > > >
> > > > Andrea, can you confirm that the same device used with the old
> > fast
> > > > kernel is still fast today?
> > >
> > > Yes, it is still fast. Just ran a 100 trials test and got an
> > average
> > > of 70 seconds with standard deviation = 6 seconds, aligned with
> > the
> > > past values of the same kernel.
> >
> > Then can you collect trace on the old kernel via the previous
> > script?
> >
> > #!/bin/sh
> >
> > MAJ=$1
> > MIN=$2
> > MAJ=$(( $MAJ << 20 ))
> > DEV=$(( $MAJ | $MIN ))
> >
> > /usr/share/bcc/tools/trace -t -C \
> > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> > >rwbs, args->sector, args->nr_sector' \
> > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args-
> > >rwbs, args->sector, args->nr_sector'
> >
> > Both the two trace points and bcc should be available on the old
> > kernel.
> >
>
> Trace attached. Produced by: start the trace script
> (with the pendrive already plugged), wait some seconds, run the test
> (1 trial, 1 GB), wait for the test to finish, stop the trace.
>
> The copy took 73 seconds, roughly as already seen before with the fast
> old kernel.
This trace shows a good write IO order because the writeback IOs are
queued to block layer serially from the 'cp' task and writeback wq.
However, writeback IO order is changed in current linus tree because
the IOs are queued to block layer concurrently from the 'cp' task
and writeback wq. It might be related with killing queue_congestion
by blk-mq.
The performance effect could be not only on this specific USB drive,
but also on all HDD., I guess.
However, I still can't reproduce it in my VM even though I built it
with similar setting of Andrea's test machine. Maybe the emulated disk
is too fast than Andrea's.
Andrea, can you collect the following log when running the test
on current new(bad) kernel?
/usr/share/bcc/tools/stackcount -K blk_mq_make_request
Thanks,
Ming
On Fri, Nov 29, 2019 at 08:57:34AM +0800, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 06:34:32PM +0100, Andrea Vai wrote:
> > Il giorno gio, 28/11/2019 alle 17.17 +0800, Ming Lei ha scritto:
> > > On Thu, Nov 28, 2019 at 08:46:57AM +0100, Andrea Vai wrote:
> > > > Il giorno mer, 27/11/2019 alle 08.14 +0000, Schmid, Carsten ha
> > > > scritto:
> > > > > >
> > > > > > > Then I started another set of 100 trials and let them run
> > > > > tonight, and
> > > > > > > the first 10 trials were around 1000s, then gradually
> > > decreased
> > > > > to
> > > > > > > ~300s, and finally settled around 200s with some trials
> > > below
> > > > > 70-80s.
> > > > > > > This to say, times are extremely variable and for the first
> > > time
> > > > > I
> > > > > > > noticed a sort of "performance increase" with time.
> > > > > > >
> > > > > >
> > > > > > The sheer volume of testing (probably some terabytes by now)
> > > would
> > > > > > exercise the wear leveling algorithm in the FTL.
> > > > > >
> > > > > But with "old kernel" the copy operation still is "fast", as far
> > > as
> > > > > i understood.
> > > > > If FTL (e.g. wear leveling) would slow down, we would see that
> > > also
> > > > > in
> > > > > the old kernel, right?
> > > > >
> > > > > Andrea, can you confirm that the same device used with the old
> > > fast
> > > > > kernel is still fast today?
> > > >
> > > > Yes, it is still fast. Just ran a 100 trials test and got an
> > > average
> > > > of 70 seconds with standard deviation = 6 seconds, aligned with
> > > the
> > > > past values of the same kernel.
> > >
> > > Then can you collect trace on the old kernel via the previous
> > > script?
> > >
> > > #!/bin/sh
> > >
> > > MAJ=$1
> > > MIN=$2
> > > MAJ=$(( $MAJ << 20 ))
> > > DEV=$(( $MAJ | $MIN ))
> > >
> > > /usr/share/bcc/tools/trace -t -C \
> > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> > > >rwbs, args->sector, args->nr_sector' \
> > > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args-
> > > >rwbs, args->sector, args->nr_sector'
> > >
> > > Both the two trace points and bcc should be available on the old
> > > kernel.
> > >
> >
> > Trace attached. Produced by: start the trace script
> > (with the pendrive already plugged), wait some seconds, run the test
> > (1 trial, 1 GB), wait for the test to finish, stop the trace.
> >
> > The copy took 73 seconds, roughly as already seen before with the fast
> > old kernel.
>
> This trace shows a good write IO order because the writeback IOs are
> queued to block layer serially from the 'cp' task and writeback wq.
>
> However, writeback IO order is changed in current linus tree because
> the IOs are queued to block layer concurrently from the 'cp' task
> and writeback wq. It might be related with killing queue_congestion
> by blk-mq.
>
> The performance effect could be not only on this specific USB drive,
> but also on all HDD., I guess.
>
> However, I still can't reproduce it in my VM even though I built it
> with similar setting of Andrea's test machine. Maybe the emulated disk
> is too fast than Andrea's.
>
> Andrea, can you collect the following log when running the test
> on current new(bad) kernel?
>
> /usr/share/bcc/tools/stackcount -K blk_mq_make_request
Instead, please run the following trace, given insert may be
called from other paths, such as flush plug:
/usr/share/bcc/tools/stackcount -K t:block:block_rq_insert
If you are using python3, the following failure may be triggered:
"cannot use a bytes pattern on a string-like object"
Then apply the following fix on /usr/lib/python3.7/site-packages/bcc/__init__.py
diff --git a/src/python/bcc/__init__.py b/src/python/bcc/__init__.py
index 6f114de8..bff5f282 100644
--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -769,7 +769,7 @@ class BPF(object):
evt_dir = os.path.join(cat_dir, event)
if os.path.isdir(evt_dir):
tp = ("%s:%s" % (category, event))
- if re.match(tp_re, tp):
+ if re.match(tp_re.decode(), tp):
results.append(tp)
return results
Thanks,
Ming
>> Trace attached. Produced by: start the trace script
>> (with the pendrive already plugged), wait some seconds, run the test
>> (1 trial, 1 GB), wait for the test to finish, stop the trace.
>>
>> The copy took 73 seconds, roughly as already seen before with the fast
>> old kernel.
>
> This trace shows a good write IO order because the writeback IOs are
> queued to block layer serially from the 'cp' task and writeback wq.
>
> However, writeback IO order is changed in current linus tree because
> the IOs are queued to block layer concurrently from the 'cp' task
> and writeback wq. It might be related with killing queue_congestion
> by blk-mq.
What about using direct-io to ensure order is guaranteed? Pity that 'cp'
doesn't seem to have an option for it. But dd should do the trick.
Andrea, can you replace cp with a dd command (on the slow kernel)?
dd if=<path-to-src-file> of=<path-to-copy-on-flash-device> bs=1M
oflag=direct
- Bernd
Il giorno ven, 29/11/2019 alle 10.35 +0800, Ming Lei ha scritto:
> On Fri, Nov 29, 2019 at 08:57:34AM +0800, Ming Lei wrote:
>
> > [...]
>
> > Andrea, can you collect the following log when running the test
> > on current new(bad) kernel?
> >
> > /usr/share/bcc/tools/stackcount -K blk_mq_make_request
>
> Instead, please run the following trace, given insert may be
> called from other paths, such as flush plug:
>
> /usr/share/bcc/tools/stackcount -K t:block:block_rq_insert
Attached, for new (patched) bad kernel.
Produced by: start the trace script (with the pendrive already
plugged), wait some seconds, run the test (1 trial, 1 GB), wait for
the test to finish, stop the trace.
The copy took ~1700 seconds.
Thanks,
Andrea
On 29/11/19 12:44:53, Bernd Schubert wrote:
> >> Trace attached. Produced by: start the trace script
> >> (with the pendrive already plugged), wait some seconds, run the test
> >> (1 trial, 1 GB), wait for the test to finish, stop the trace.
> >>
> >> The copy took 73 seconds, roughly as already seen before with the fast
> >> old kernel.
> >
> > This trace shows a good write IO order because the writeback IOs are
> > queued to block layer serially from the 'cp' task and writeback wq.
> >
> > However, writeback IO order is changed in current linus tree because
> > the IOs are queued to block layer concurrently from the 'cp' task
> > and writeback wq. It might be related with killing queue_congestion
> > by blk-mq.
>
> What about using direct-io to ensure order is guaranteed? Pity that 'cp'
> doesn't seem to have an option for it. But dd should do the trick.
> Andrea, can you replace cp with a dd command (on the slow kernel)?
>
> dd if=<path-to-src-file> of=<path-to-copy-on-flash-device> bs=1M
> oflag=direct
On the "new bad patched" kernel, this command take 68 seconds to complete (mean on 100 trials, with a narrow standard deviation), so perfectly
aligned with the cp command on the old fast good kernel.
Thanks, and bye
Andrea
On Tue, Dec 10, 2019 at 08:35:43AM +0100, Andrea Vai wrote:
> Il giorno mar, 03/12/2019 alle 10.23 +0800, Ming Lei ha scritto:
> > On Fri, Nov 29, 2019 at 03:41:01PM +0100, Andrea Vai wrote:
> > > Il giorno ven, 29/11/2019 alle 10.35 +0800, Ming Lei ha scritto:
> > > > On Fri, Nov 29, 2019 at 08:57:34AM +0800, Ming Lei wrote:
> > > >
> > > > > [...]
> > > >
> > > > > Andrea, can you collect the following log when running the
> > test
> > > > > on current new(bad) kernel?
> > > > >
> > > > > /usr/share/bcc/tools/stackcount -K
> > blk_mq_make_request
> > > >
> > > > Instead, please run the following trace, given insert may be
> > > > called from other paths, such as flush plug:
> > > >
> > > > /usr/share/bcc/tools/stackcount -K t:block:block_rq_insert
> > >
> > > Attached, for new (patched) bad kernel.
> > >
> > > Produced by: start the trace script (with the pendrive already
> > > plugged), wait some seconds, run the test (1 trial, 1 GB), wait
> > for
> > > the test to finish, stop the trace.
> > >
> > > The copy took ~1700 seconds.
> >
> > See the two path[1][2] of inserting request, and path[1] is
> > triggered
> > 4358 times, and the path[2] is triggered 5763 times.
> >
> > The path[2] is expected behaviour. Not sure path [1] is correct,
> > given
> > ext4_release_file() is supposed to be called when this inode is
> > released. That means the file is closed 4358 times during 1GB file
> > copying to usb storage.
> >
> > Cc filesystem list.
> >
> >
> > [1] insert requests when returning to user mode from syscall
> >
> > b'blk_mq_sched_request_inserted'
> > b'blk_mq_sched_request_inserted'
> > b'dd_insert_requests'
> > b'blk_mq_sched_insert_requests'
> > b'blk_mq_flush_plug_list'
> > b'blk_flush_plug_list'
> > b'io_schedule_prepare'
> > b'io_schedule'
> > b'rq_qos_wait'
> > b'wbt_wait'
> > b'__rq_qos_throttle'
> > b'blk_mq_make_request'
> > b'generic_make_request'
> > b'submit_bio'
> > b'ext4_io_submit'
> > b'ext4_writepages'
> > b'do_writepages'
> > b'__filemap_fdatawrite_range'
> > b'ext4_release_file'
> > b'__fput'
> > b'task_work_run'
> > b'exit_to_usermode_loop'
> > b'do_syscall_64'
> > b'entry_SYSCALL_64_after_hwframe'
> > 4358
> >
> > [2] insert requests from writeback wq context
> >
> > b'blk_mq_sched_request_inserted'
> > b'blk_mq_sched_request_inserted'
> > b'dd_insert_requests'
> > b'blk_mq_sched_insert_requests'
> > b'blk_mq_flush_plug_list'
> > b'blk_flush_plug_list'
> > b'io_schedule_prepare'
> > b'io_schedule'
> > b'rq_qos_wait'
> > b'wbt_wait'
> > b'__rq_qos_throttle'
> > b'blk_mq_make_request'
> > b'generic_make_request'
> > b'submit_bio'
> > b'ext4_io_submit'
> > b'ext4_bio_write_page'
> > b'mpage_submit_page'
> > b'mpage_process_page_bufs'
> > b'mpage_prepare_extent_to_map'
> > b'ext4_writepages'
> > b'do_writepages'
> > b'__writeback_single_inode'
> > b'writeback_sb_inodes'
> > b'__writeback_inodes_wb'
> > b'wb_writeback'
> > b'wb_workfn'
> > b'process_one_work'
> > b'worker_thread'
> > b'kthread'
> > b'ret_from_fork'
> > 5763
> >
> > Thanks,
> > Ming
> >
>
> Is there any update on this? Sorry if I am making noise, but I would
> like to help to improve the kernel (or fix it) if I can help.
> Otherwise, please let me know how to consider this case,
IMO, the extra write path from exit_to_usermode_loop() isn't expected,
that should be the reason why write IO order is changed, then performance
drops on your USB storage.
We need our fs/ext4 experts to take a look.
Or can you reproduce the issue on xfs or btrfs?
Thanks,
Ming
On Tue, Dec 10, 2019 at 04:05:50PM +0800, Ming Lei wrote:
> > > The path[2] is expected behaviour. Not sure path [1] is correct,
> > > given
> > > ext4_release_file() is supposed to be called when this inode is
> > > released. That means the file is closed 4358 times during 1GB file
> > > copying to usb storage.
> > >
> > > [1] insert requests when returning to user mode from syscall
> > >
> > > b'blk_mq_sched_request_inserted'
> > > b'blk_mq_sched_request_inserted'
> > > b'dd_insert_requests'
> > > b'blk_mq_sched_insert_requests'
> > > b'blk_mq_flush_plug_list'
> > > b'blk_flush_plug_list'
> > > b'io_schedule_prepare'
> > > b'io_schedule'
> > > b'rq_qos_wait'
> > > b'wbt_wait'
> > > b'__rq_qos_throttle'
> > > b'blk_mq_make_request'
> > > b'generic_make_request'
> > > b'submit_bio'
> > > b'ext4_io_submit'
> > > b'ext4_writepages'
> > > b'do_writepages'
> > > b'__filemap_fdatawrite_range'
> > > b'ext4_release_file'
> > > b'__fput'
> > > b'task_work_run'
> > > b'exit_to_usermode_loop'
> > > b'do_syscall_64'
> > > b'entry_SYSCALL_64_after_hwframe'
> > > 4358
I'm guessing that your workload is repeatedly truncating a file (or
calling open with O_TRUNC) and then writing data to it. When you do
this, then when the file is closed, we assume that since you were
replacing the previous contents of a file with new contents, that you
would be unhappy if the file contents was replaced by a zero length
file after a crash. That's because ten years, ago there were a *huge*
number of crappy applications that would replace a file by reading it
into memory, truncating it, and then write out the new contents of the
file. This could be a high score file for a game, or a KDE or GNOME
state file, etc.
So if someone does open, truncate, write, close, we still immediately
writing out the data on the close, assuming that the programmer really
wanted open, truncate, write, fsync, close, but was too careless to
actually do the right thing.
Some workaround[1] like this is done by all of the major file systems,
and was fallout the agreement from the "O_PONIES"[2] controversy.
This was discussed and agreed to at the 2009 LSF/MM workshop. (See
the "rename, fsync, and ponies" section.)
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
[2] https://blahg.josefsipek.net/?p=364
[3] https://lwn.net/Articles/327601/
So if you're seeing a call to filemap_fdatawrite_range as the result
of a fput, that's why.
In any case, this behavior has been around for a decade, and it
appears to be incidental to your performance difficulties with your
USB thumbdrive and block-mq.
- Ted
Il giorno lun, 23/12/2019 alle 21.08 +0800, Ming Lei ha scritto:
> On Mon, Dec 23, 2019 at 12:22:45PM +0100, Andrea Vai wrote:
> > Il giorno mer, 18/12/2019 alle 17.48 +0800, Ming Lei ha scritto:
> > > On Wed, Dec 18, 2019 at 09:25:02AM +0100, Andrea Vai wrote:
> > > > Il giorno gio, 12/12/2019 alle 05.33 +0800, Ming Lei ha
> scritto:
> > > > > On Wed, Dec 11, 2019 at 11:07:45AM -0500, Theodore Y. Ts'o
> > > wrote:
> > > > > > On Wed, Dec 11, 2019 at 12:00:58PM +0800, Ming Lei wrote:
> > > > > > > I didn't reproduce the issue in my test environment, and
> > > follows
> > > > > > > Andrea's test commands[1]:
> > > > > > >
> > > > > > > mount UUID=$uuid /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > > SECONDS=0
> > > > > > > cp $testfile /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > > umount /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > >
> > > > > > > The 'cp' command supposes to open/close the file just
> once,
> > > > > however
> > > > > > > ext4_release_file() & write pages is observed to run for
> > > 4358
> > > > > times
> > > > > > > when executing the above 'cp' test.
> > > > > >
> > > > > > Why are we sure the ext4_release_file() / _fput() is
> coming
> > > from
> > > > > the
> > > > > > cp command, as opposed to something else that might be
> running
> > > on
> > > > > the
> > > > > > system under test? _fput() is called by the kernel when
> the
> > > last
> > > > >
> > > > > Please see the log:
> > > > >
> > > > >
> > >
> https://lore.kernel.org/linux-scsi/[email protected]/2-log_ming.zip
> > > > >
> > > > > Which is collected by:
> > > > >
> > > > > #!/bin/sh
> > > > > MAJ=$1
> > > > > MIN=$2
> > > > > MAJ=$(( $MAJ << 20 ))
> > > > > DEV=$(( $MAJ | $MIN ))
> > > > >
> > > > > /usr/share/bcc/tools/trace -t -C \
> > > > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d
> %d",
> > > args-
> > > > > >rwbs, args->sector, args->nr_sector' \
> > > > > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d
> %d",
> > > args-
> > > > > >rwbs, args->sector, args->nr_sector'
> > > > >
> > > > > $MAJ:$MIN points to the USB storage disk.
> > > > >
> > > > > From the above IO trace, there are two write paths, one is
> from
> > > cp,
> > > > > another is from writeback wq.
> > > > >
> > > > > The stackcount trace[1] is consistent with the IO trace log
> > > since it
> > > > > only shows two IO paths, that is why I concluded that the
> write
> > > done
> > > > > via
> > > > > ext4_release_file() is from 'cp'.
> > > > >
> > > > > [1]
> > > > >
> > >
> https://lore.kernel.org/linux-scsi/[email protected]/2-log_ming_20191129_150609.zip
> > > > >
> > > > > > reference to a struct file is released. (Specifically, if
> you
> > > > > have a
> > > > > > fd which is dup'ed, it's only when the last fd
> corresponding
> > > to
> > > > > the
> > > > > > struct file is closed, and the struct file is about to be
> > > > > released,
> > > > > > does the file system's f_ops->release function get
> called.)
> > > > > >
> > > > > > So the first question I'd ask is whether there is anything
> > > else
> > > > > going
> > > > > > on the system, and whether the writes are happening to the
> USB
> > > > > thumb
> > > > > > drive, or to some other storage device. And if there is
> > > something
> > > > > > else which is writing to the pendrive, maybe that's why no
> one
> > > > > else
> > > > > > has been able to reproduce the OP's complaint....
> > > > >
> > > > > OK, we can ask Andrea to confirm that via the following
> trace,
> > > which
> > > > > will add pid/comm info in the stack trace:
> > > > >
> > > > > /usr/share/bcc/tools/stackcount
> blk_mq_sched_request_inserted
> > > > >
> > > > > Andrew, could you collect the above log again when running
> > > new/bad
> > > > > kernel for confirming if the write done by
> ext4_release_file()
> > > is
> > > > > from
> > > > > the 'cp' process?
> > > >
> > > > You can find the stackcount log attached. It has been produced
> by:
> > > >
> > > > - /usr/share/bcc/tools/stackcount
> blk_mq_sched_request_inserted >
> > > trace.log
> > > > - wait some seconds
> > > > - run the test (1 copy trial), wait for the test to finish,
> wait
> > > some seconds
> > > > - stop the trace (ctrl+C)
> > >
> > > Thanks for collecting the log, looks your 'stackcount' doesn't
> > > include
> > > comm/pid info, seems there is difference between your bcc and
> > > my bcc in fedora 30.
> > >
> > > Could you collect above log again via the following command?
> > >
> > > /usr/share/bcc/tools/stackcount -P -K t:block:block_rq_insert
> > >
> > > which will show the comm/pid info.
> >
> > ok, attached (trace_20191219.txt), the test (1 trial) took 3684
> > seconds.
>
> From the above trace:
>
> b'blk_mq_sched_request_inserted'
> b'blk_mq_sched_request_inserted'
> b'dd_insert_requests'
> b'blk_mq_sched_insert_requests'
> b'blk_mq_flush_plug_list'
> b'blk_flush_plug_list'
> b'io_schedule_prepare'
> b'io_schedule'
> b'rq_qos_wait'
> b'wbt_wait'
> b'__rq_qos_throttle'
> b'blk_mq_make_request'
> b'generic_make_request'
> b'submit_bio'
> b'ext4_io_submit'
> b'ext4_writepages'
> b'do_writepages'
> b'__filemap_fdatawrite_range'
> b'ext4_release_file'
> b'__fput'
> b'task_work_run'
> b'exit_to_usermode_loop'
> b'do_syscall_64'
> b'entry_SYSCALL_64_after_hwframe'
> b'cp' [19863]
> 4400
>
> So this write is clearly from 'cp' process, and it should be one
> ext4 fs issue.
>
> Ted, can you take a look at this issue?
>
> >
> > > > I also tried the usual test with btrfs and xfs. Btrfs behavior
> > > looks
> > > > "good". xfs seems sometimes better, sometimes worse, I would
> say.
> > > I
> > > > don't know if it matters, anyway you can also find the results
> of
> > > the
> > > > two tests (100 trials each). Basically, btrfs is always
> between 68
> > > and
> > > > 89 seconds, with a cyclicity (?) with "period=2 trials". xfs
> looks
> > > > almost always very good (63-65s), but sometimes "bad" (>300s).
> > >
> > > If you are interested in digging into this one, the following
> trace
> > > should be helpful:
> > >
> > >
> https://lore.kernel.org/linux-scsi/[email protected]/T/#m5aa008626e07913172ad40e1eb8e5f2ffd560fc6
> > >
> >
> > Attached:
> > - trace_xfs_20191223.txt (7 trials, then aborted while doing the
> 8th),
> > times to complete:
> > 64s
> > 63s
> > 64s
> > 833s
> > 1105s
> > 63s
> > 64s
>
> oops, looks we have to collect io insert trace with the following
> bcc script
> on xfs for confirming if there is similar issue with ext4, could you
> run
> it again on xfs? And only post the trace done in case of slow 'cp'.
>
>
> #!/bin/sh
>
> MAJ=$1
> MIN=$2
> MAJ=$(( $MAJ << 20 ))
> DEV=$(( $MAJ | $MIN ))
>
> /usr/share/bcc/tools/trace -t -C \
> 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> >rwbs, args->sector, args->nr_sector' \
> 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args-
> >rwbs, args->sector, args->nr_sector'
>
>
here it is (1 trial, 313 seconds to finish)
Thanks,
Andrea
On Mon, Dec 23, 2019 at 03:02:35PM +0100, Andrea Vai wrote:
> Il giorno lun, 23/12/2019 alle 21.08 +0800, Ming Lei ha scritto:
> > On Mon, Dec 23, 2019 at 12:22:45PM +0100, Andrea Vai wrote:
> > > Il giorno mer, 18/12/2019 alle 17.48 +0800, Ming Lei ha scritto:
> > > > On Wed, Dec 18, 2019 at 09:25:02AM +0100, Andrea Vai wrote:
> > > > > Il giorno gio, 12/12/2019 alle 05.33 +0800, Ming Lei ha
> > scritto:
> > > > > > On Wed, Dec 11, 2019 at 11:07:45AM -0500, Theodore Y. Ts'o
> > > > wrote:
> > > > > > > On Wed, Dec 11, 2019 at 12:00:58PM +0800, Ming Lei wrote:
> > > > > > > > I didn't reproduce the issue in my test environment, and
> > > > follows
> > > > > > > > Andrea's test commands[1]:
> > > > > > > >
> > > > > > > > mount UUID=$uuid /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > > > SECONDS=0
> > > > > > > > cp $testfile /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > > > umount /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > > >
> > > > > > > > The 'cp' command supposes to open/close the file just
> > once,
> > > > > > however
> > > > > > > > ext4_release_file() & write pages is observed to run for
> > > > 4358
> > > > > > times
> > > > > > > > when executing the above 'cp' test.
> > > > > > >
> > > > > > > Why are we sure the ext4_release_file() / _fput() is
> > coming
> > > > from
> > > > > > the
> > > > > > > cp command, as opposed to something else that might be
> > running
> > > > on
> > > > > > the
> > > > > > > system under test? _fput() is called by the kernel when
> > the
> > > > last
> > > > > >
> > > > > > Please see the log:
> > > > > >
> > > > > >
> > > >
> > https://lore.kernel.org/linux-scsi/[email protected]/2-log_ming.zip
> > > > > >
> > > > > > Which is collected by:
> > > > > >
> > > > > > #!/bin/sh
> > > > > > MAJ=$1
> > > > > > MIN=$2
> > > > > > MAJ=$(( $MAJ << 20 ))
> > > > > > DEV=$(( $MAJ | $MIN ))
> > > > > >
> > > > > > /usr/share/bcc/tools/trace -t -C \
> > > > > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d
> > %d",
> > > > args-
> > > > > > >rwbs, args->sector, args->nr_sector' \
> > > > > > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d
> > %d",
> > > > args-
> > > > > > >rwbs, args->sector, args->nr_sector'
> > > > > >
> > > > > > $MAJ:$MIN points to the USB storage disk.
> > > > > >
> > > > > > From the above IO trace, there are two write paths, one is
> > from
> > > > cp,
> > > > > > another is from writeback wq.
> > > > > >
> > > > > > The stackcount trace[1] is consistent with the IO trace log
> > > > since it
> > > > > > only shows two IO paths, that is why I concluded that the
> > write
> > > > done
> > > > > > via
> > > > > > ext4_release_file() is from 'cp'.
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://lore.kernel.org/linux-scsi/[email protected]/2-log_ming_20191129_150609.zip
> > > > > >
> > > > > > > reference to a struct file is released. (Specifically, if
> > you
> > > > > > have a
> > > > > > > fd which is dup'ed, it's only when the last fd
> > corresponding
> > > > to
> > > > > > the
> > > > > > > struct file is closed, and the struct file is about to be
> > > > > > released,
> > > > > > > does the file system's f_ops->release function get
> > called.)
> > > > > > >
> > > > > > > So the first question I'd ask is whether there is anything
> > > > else
> > > > > > going
> > > > > > > on the system, and whether the writes are happening to the
> > USB
> > > > > > thumb
> > > > > > > drive, or to some other storage device. And if there is
> > > > something
> > > > > > > else which is writing to the pendrive, maybe that's why no
> > one
> > > > > > else
> > > > > > > has been able to reproduce the OP's complaint....
> > > > > >
> > > > > > OK, we can ask Andrea to confirm that via the following
> > trace,
> > > > which
> > > > > > will add pid/comm info in the stack trace:
> > > > > >
> > > > > > /usr/share/bcc/tools/stackcount
> > blk_mq_sched_request_inserted
> > > > > >
> > > > > > Andrew, could you collect the above log again when running
> > > > new/bad
> > > > > > kernel for confirming if the write done by
> > ext4_release_file()
> > > > is
> > > > > > from
> > > > > > the 'cp' process?
> > > > >
> > > > > You can find the stackcount log attached. It has been produced
> > by:
> > > > >
> > > > > - /usr/share/bcc/tools/stackcount
> > blk_mq_sched_request_inserted >
> > > > trace.log
> > > > > - wait some seconds
> > > > > - run the test (1 copy trial), wait for the test to finish,
> > wait
> > > > some seconds
> > > > > - stop the trace (ctrl+C)
> > > >
> > > > Thanks for collecting the log, looks your 'stackcount' doesn't
> > > > include
> > > > comm/pid info, seems there is difference between your bcc and
> > > > my bcc in fedora 30.
> > > >
> > > > Could you collect above log again via the following command?
> > > >
> > > > /usr/share/bcc/tools/stackcount -P -K t:block:block_rq_insert
> > > >
> > > > which will show the comm/pid info.
> > >
> > > ok, attached (trace_20191219.txt), the test (1 trial) took 3684
> > > seconds.
> >
> > From the above trace:
> >
> > b'blk_mq_sched_request_inserted'
> > b'blk_mq_sched_request_inserted'
> > b'dd_insert_requests'
> > b'blk_mq_sched_insert_requests'
> > b'blk_mq_flush_plug_list'
> > b'blk_flush_plug_list'
> > b'io_schedule_prepare'
> > b'io_schedule'
> > b'rq_qos_wait'
> > b'wbt_wait'
> > b'__rq_qos_throttle'
> > b'blk_mq_make_request'
> > b'generic_make_request'
> > b'submit_bio'
> > b'ext4_io_submit'
> > b'ext4_writepages'
> > b'do_writepages'
> > b'__filemap_fdatawrite_range'
> > b'ext4_release_file'
> > b'__fput'
> > b'task_work_run'
> > b'exit_to_usermode_loop'
> > b'do_syscall_64'
> > b'entry_SYSCALL_64_after_hwframe'
> > b'cp' [19863]
> > 4400
> >
> > So this write is clearly from 'cp' process, and it should be one
> > ext4 fs issue.
> >
> > Ted, can you take a look at this issue?
> >
> > >
> > > > > I also tried the usual test with btrfs and xfs. Btrfs behavior
> > > > looks
> > > > > "good". xfs seems sometimes better, sometimes worse, I would
> > say.
> > > > I
> > > > > don't know if it matters, anyway you can also find the results
> > of
> > > > the
> > > > > two tests (100 trials each). Basically, btrfs is always
> > between 68
> > > > and
> > > > > 89 seconds, with a cyclicity (?) with "period=2 trials". xfs
> > looks
> > > > > almost always very good (63-65s), but sometimes "bad" (>300s).
> > > >
> > > > If you are interested in digging into this one, the following
> > trace
> > > > should be helpful:
> > > >
> > > >
> > https://lore.kernel.org/linux-scsi/[email protected]/T/#m5aa008626e07913172ad40e1eb8e5f2ffd560fc6
> > > >
> > >
> > > Attached:
> > > - trace_xfs_20191223.txt (7 trials, then aborted while doing the
> > 8th),
> > > times to complete:
> > > 64s
> > > 63s
> > > 64s
> > > 833s
> > > 1105s
> > > 63s
> > > 64s
> >
> > oops, looks we have to collect io insert trace with the following
> > bcc script
> > on xfs for confirming if there is similar issue with ext4, could you
> > run
> > it again on xfs? And only post the trace done in case of slow 'cp'.
> >
> >
> > #!/bin/sh
> >
> > MAJ=$1
> > MIN=$2
> > MAJ=$(( $MAJ << 20 ))
> > DEV=$(( $MAJ | $MIN ))
> >
> > /usr/share/bcc/tools/trace -t -C \
> > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args-
> > >rwbs, args->sector, args->nr_sector' \
> > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args-
> > >rwbs, args->sector, args->nr_sector'
> >
> >
> here it is (1 trial, 313 seconds to finish)
The above log shows similar issue with ext4 since there is another
writeback IO path from 'cp' process. And the following trace can show if
it is same with ext4's issue:
/usr/share/bcc/tools/stackcount -P -K t:block:block_rq_insert
Thanks,
Ming
On Tue, Dec 24, 2019 at 09:27:07AM +0800, Ming Lei wrote:
> The ext4_release_file() should be run from read() or write() syscall if
> Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected behavior
> for ext4_release_file() to be run thousands of times when just
> running 'cp' once, see comment of ext4_release_file():
What's your evidence of that? As opposed to the writeback taking a
long time, leading to the *one* call of ext4_release_file taking a
long time? If it's a big file, we might very well be calliing
ext4_writepages multiple times, from a single call to
__filemap_fdatawrite_range().
You confused mightily from that assertion, and that caused me to make
assumptions that cp was doing something crazy. But I'm quite conviced
now that this is almost certainly not what is happening.
> > I suspect the next step is use a blktrace, to see what kind of I/O is
> > being sent to the USB drive, and how long it takes for the I/O to
> > complete. You might also try to capture the output of "iostat -x 1"
> > while the script is running, and see what the difference might be
> > between a kernel version that has the problem and one that doesn't,
> > and see if that gives us a clue.
>
> That isn't necessary, given we have concluded that the bad write
> performance is caused by broken write order.
I didn't see any evidence of that from what I had in my inbox, so I
went back to the mailing list archives to figure out what you were
talking about. Part of the problem is this has been a very
long-spanning thread, and I had deleted from my inbox all of the parts
relating to the MQ scheduler since that was clearly Not My Problem. :-)
So, summarizing the most of the thread. The problem started when we
removed the legacy I/O scheduler, since we are now only using the MQ
scheduler. What the kernel is sending is long writes (240 sectors),
but it is being sent as an interleaved stream of two sequential
writes. This particular pendrive can't handle this workload, because
it has a very simplistic Flash Translation Layer. Now, this is not
*broken*, from a storage perspective; it's just that it's more than
the simple little brain of this particular pen drive can handle.
Previously, with a single queue, and specially since the queue depth
supported by this pen drive is 1, the elevator algorithm would sort
the I/O requests so that it would be mostly sequential, and this
wouldn't be much of a problem. However, once the legacy I/O stack was
removed, the MQ stack is designed so that we don't have to take a
global lock in order to submit an I/O request. That also means that
we can't do a full elevator sort since that would require locking all
of the queues.
This is not a problem, since HDD's generally have a 16 deep queue, and
SSD's have a super-deep queue depth since they get their speed via
parallel writes to different flash chips. Unfortunately, it *is* a
problem for super primitive USB sticks.
> So far, the reason points to the extra writeback path from exit_to_usermode_loop().
> If it is not from close() syscall, the issue should be related with file reference
> count. If it is from close() syscall, the issue might be in 'cp''s
> implementation.
Oh, it's probably from the close system call; and it's *only* from a
single close system call. Because there is the auto delayed
allocation resolution to protect against buggy userspace, under
certain circumstances, as I explained earlier, we force a full
writeout on a close for a file decsriptor which was opened with an
O_TRUNC. This is by *design*, since we are trying to protect against
buggy userspace (application programmers vastly outnumber file system
programmers, and far too many of them want O_PONY). This is Working
As Intended.
You can disable it by deleting the test file before the cp:
rm -f /mnt/pendrive/$testfile
Or you can disable the protection against stupid userspace by using
the noauto_da_alloc mount option. (But then if you have a buggy game
program which writes the top-ten score file by using open(2) w/
O_TRUNC, and then said program closes the OpenGL library, and the
proprietary 3rd party binary-only video driver wedges the X server
requiring a hard reset to recover, and the top-ten score file becomes
a zero-length file, don't come crying to me... Or if a graphical text
editor forgets to use fsync(2) before saving a source file you spent
hours working on, and then the system crashes at exactly the wrong
moment and your source file becomes zero-length, against, don't come
crying to me. Blame the stupid application programmer which wrote
your text editor who decided to skip the fsync(2), or who decided that
copying the ACL's and xattrs was Too Hard(tm), and so opening the file
with O_TRUNC and rewriting the file in place was easier for the
application programmer.)
In any case, I think this is all working all as intended. The MQ I/O
stack is optimized for modern HDD and SSD's, and especially SSD's.
And the file system assumes that parallel sequential writes,
especially if they are large, is really not a big deal, since that's
what NCQ or massive parallelism of pretty much all SSD's want.
(Again, ignoring the legacy of crappy flash drives.
You can argue with storage stack folks about whether we need to have
super-dumb mode for slow, crappy flash which uses a global lock and a
global elevator scheduler for super-crappy flash if you want. I'm
going to stay out of that argument.
- Ted
On Thu, Dec 26, 2019 at 10:27:02AM +0800, Ming Lei wrote:
> Maybe we need to be careful for HDD., since the request count in scheduler
> queue is double of in-flight request count, and in theory NCQ should only
> cover all in-flight 32 requests. I will find a sata HDD., and see if
> performance drop can be observed in the similar 'cp' test.
Please try to measure it, but I'd be really surprised if it's
significant with with modern HDD's. That because they typically have
a queue depth of 16, and a max_sectors_kb of 32767 (e.g., just under
32 MiB). Sort seeks are typically 1-2 ms, with full stroke seeks
8-10ms. Typical sequential write speeds on a 7200 RPM drive is
125-150 MiB/s. So suppose every other request sent to the HDD is from
the other request stream. The disk will chose the 8 requests from its
queue that are contiguous, and so it will be writing around 256 MiB,
which will take 2-3 seconds. If it then needs to spend between 1 and
10 ms seeking to another location of the disk, before it writes the
next 256 MiB, the worst case overhead of that seek is 10ms / 2s, or
0.5%. That may very well be within your measurements' error bars.
And of course, note that in real life, we are very *often* writing to
multiple files in parallel, for example, during a "make -j16" while
building the kernel. Writing a single large file is certainly
something people do (but even there people who are burning a 4G DVD
rip are often browsing the web while they are waiting for it to
complete, and the browser will be writing cache files, etc.). So
whether or not this is something where we should be stressing over
this specific workload is going to be quite debateable.
- Ted
On Wed, Dec 25, 2019 at 10:30:57PM -0500, Theodore Y. Ts'o wrote:
> On Thu, Dec 26, 2019 at 10:27:02AM +0800, Ming Lei wrote:
> > Maybe we need to be careful for HDD., since the request count in scheduler
> > queue is double of in-flight request count, and in theory NCQ should only
> > cover all in-flight 32 requests. I will find a sata HDD., and see if
> > performance drop can be observed in the similar 'cp' test.
>
> Please try to measure it, but I'd be really surprised if it's
> significant with with modern HDD's.
Just find one machine with AHCI SATA, and run the following xfs
overwrite test:
#!/bin/bash
DIR=$1
echo 3 > /proc/sys/vm/drop_caches
fio --readwrite=write --filesize=5g --overwrite=1 --filename=$DIR/fiofile \
--runtime=60s --time_based --ioengine=psync --direct=0 --bs=4k
--iodepth=128 --numjobs=2 --group_reporting=1 --name=overwrite
FS is xfs, and disk is LVM over AHCI SATA with NCQ(depth 32), because the
machine is picked up from RH beaker, and it is the only disk in the box.
#lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931.5G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 930.5G 0 part
├─rhel_hpe--ml10gen9--01-root 253:0 0 50G 0 lvm /
├─rhel_hpe--ml10gen9--01-swap 253:1 0 3.9G 0 lvm [SWAP]
└─rhel_hpe--ml10gen9--01-home 253:2 0 876.6G 0 lvm /home
kernel: 3a7ea2c483a53fc("scsi: provide mq_ops->busy() hook") which is
the previous commit of f664a3cc17b7 ("scsi: kill off the legacy IO path").
|scsi_mod.use_blk_mq=N |scsi_mod.use_blk_mq=Y |
-----------------------------------------------------------
throughput: |244MB/s |169MB/s |
-----------------------------------------------------------
Similar result can be observed on v5.4 kernel(184MB/s) with same test
steps.
> That because they typically have
> a queue depth of 16, and a max_sectors_kb of 32767 (e.g., just under
> 32 MiB). Sort seeks are typically 1-2 ms, with full stroke seeks
> 8-10ms. Typical sequential write speeds on a 7200 RPM drive is
> 125-150 MiB/s. So suppose every other request sent to the HDD is from
> the other request stream. The disk will chose the 8 requests from its
> queue that are contiguous, and so it will be writing around 256 MiB,
> which will take 2-3 seconds. If it then needs to spend between 1 and
> 10 ms seeking to another location of the disk, before it writes the
> next 256 MiB, the worst case overhead of that seek is 10ms / 2s, or
> 0.5%. That may very well be within your measurements' error bars.
Looks you assume that disk seeking just happens once when writing around
256MB. This assumption may not be true, given all data can be in page
cache before writing. So when two tasks are submitting IOs concurrently,
IOs from each single task is sequential, and NCQ may order the current batch
submitted from the two streams. However disk seeking may still be needed
for the next batch handled by NCQ.
> And of course, note that in real life, we are very *often* writing to
> multiple files in parallel, for example, during a "make -j16" while
> building the kernel. Writing a single large file is certainly
> something people do (but even there people who are burning a 4G DVD
> rip are often browsing the web while they are waiting for it to
> complete, and the browser will be writing cache files, etc.). So
> whether or not this is something where we should be stressing over
> this specific workload is going to be quite debateable.
Thanks,
Ming