2019-12-23 13:09:39

by Ming Lei

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

On Mon, Dec 23, 2019 at 12:22:45PM +0100, Andrea Vai wrote:
> Il giorno mer, 18/12/2019 alle 17.48 +0800, Ming Lei ha scritto:
> > On Wed, Dec 18, 2019 at 09:25:02AM +0100, Andrea Vai wrote:
> > > Il giorno gio, 12/12/2019 alle 05.33 +0800, Ming Lei ha scritto:
> > > > On Wed, Dec 11, 2019 at 11:07:45AM -0500, Theodore Y. Ts'o
> > wrote:
> > > > > On Wed, Dec 11, 2019 at 12:00:58PM +0800, Ming Lei wrote:
> > > > > > I didn't reproduce the issue in my test environment, and
> > follows
> > > > > > Andrea's test commands[1]:
> > > > > >
> > > > > > mount UUID=$uuid /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > SECONDS=0
> > > > > > cp $testfile /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > umount /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > >
> > > > > > The 'cp' command supposes to open/close the file just once,
> > > > however
> > > > > > ext4_release_file() & write pages is observed to run for
> > 4358
> > > > times
> > > > > > when executing the above 'cp' test.
> > > > >
> > > > > Why are we sure the ext4_release_file() / _fput() is coming
> > from
> > > > the
> > > > > cp command, as opposed to something else that might be running
> > on
> > > > the
> > > > > system under test? _fput() is called by the kernel when the
> > last
> > > >
> > > > Please see the log:
> > > >
> > > >
> > https://lore.kernel.org/linux-scsi/[email protected]/2-log_ming.zip
> > > >
> > > > Which is collected by:
> > > >
> > > > #!/bin/sh
> > > > MAJ=$1
> > > > MIN=$2
> > > > MAJ=$(( $MAJ << 20 ))
> > > > DEV=$(( $MAJ | $MIN ))
> > > >
> > > > /usr/share/bcc/tools/trace -t -C \
> > > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d",
> > args-
> > > > >rwbs, args->sector, args->nr_sector' \
> > > > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d",
> > args-
> > > > >rwbs, args->sector, args->nr_sector'
> > > >
> > > > $MAJ:$MIN points to the USB storage disk.
> > > >
> > > > From the above IO trace, there are two write paths, one is from
> > cp,
> > > > another is from writeback wq.
> > > >
> > > > The stackcount trace[1] is consistent with the IO trace log
> > since it
> > > > only shows two IO paths, that is why I concluded that the write
> > done
> > > > via
> > > > ext4_release_file() is from 'cp'.
> > > >
> > > > [1]
> > > >
> > https://lore.kernel.org/linux-scsi/[email protected]/2-log_ming_20191129_150609.zip
> > > >
> > > > > reference to a struct file is released. (Specifically, if you
> > > > have a
> > > > > fd which is dup'ed, it's only when the last fd corresponding
> > to
> > > > the
> > > > > struct file is closed, and the struct file is about to be
> > > > released,
> > > > > does the file system's f_ops->release function get called.)
> > > > >
> > > > > So the first question I'd ask is whether there is anything
> > else
> > > > going
> > > > > on the system, and whether the writes are happening to the USB
> > > > thumb
> > > > > drive, or to some other storage device. And if there is
> > something
> > > > > else which is writing to the pendrive, maybe that's why no one
> > > > else
> > > > > has been able to reproduce the OP's complaint....
> > > >
> > > > OK, we can ask Andrea to confirm that via the following trace,
> > which
> > > > will add pid/comm info in the stack trace:
> > > >
> > > > /usr/share/bcc/tools/stackcount blk_mq_sched_request_inserted
> > > >
> > > > Andrew, could you collect the above log again when running
> > new/bad
> > > > kernel for confirming if the write done by ext4_release_file()
> > is
> > > > from
> > > > the 'cp' process?
> > >
> > > You can find the stackcount log attached. It has been produced by:
> > >
> > > - /usr/share/bcc/tools/stackcount blk_mq_sched_request_inserted >
> > trace.log
> > > - wait some seconds
> > > - run the test (1 copy trial), wait for the test to finish, wait
> > some seconds
> > > - stop the trace (ctrl+C)
> >
> > Thanks for collecting the log, looks your 'stackcount' doesn't
> > include
> > comm/pid info, seems there is difference between your bcc and
> > my bcc in fedora 30.
> >
> > Could you collect above log again via the following command?
> >
> > /usr/share/bcc/tools/stackcount -P -K t:block:block_rq_insert
> >
> > which will show the comm/pid info.
>
> ok, attached (trace_20191219.txt), the test (1 trial) took 3684
> seconds.

From the above trace:

b'blk_mq_sched_request_inserted'
b'blk_mq_sched_request_inserted'
b'dd_insert_requests'
b'blk_mq_sched_insert_requests'
b'blk_mq_flush_plug_list'
b'blk_flush_plug_list'
b'io_schedule_prepare'
b'io_schedule'
b'rq_qos_wait'
b'wbt_wait'
b'__rq_qos_throttle'
b'blk_mq_make_request'
b'generic_make_request'
b'submit_bio'
b'ext4_io_submit'
b'ext4_writepages'
b'do_writepages'
b'__filemap_fdatawrite_range'
b'ext4_release_file'
b'__fput'
b'task_work_run'
b'exit_to_usermode_loop'
b'do_syscall_64'
b'entry_SYSCALL_64_after_hwframe'
b'cp' [19863]
4400

So this write is clearly from 'cp' process, and it should be one
ext4 fs issue.

Ted, can you take a look at this issue?

>
> > > I also tried the usual test with btrfs and xfs. Btrfs behavior
> > looks
> > > "good". xfs seems sometimes better, sometimes worse, I would say.
> > I
> > > don't know if it matters, anyway you can also find the results of
> > the
> > > two tests (100 trials each). Basically, btrfs is always between 68
> > and
> > > 89 seconds, with a cyclicity (?) with "period=2 trials". xfs looks
> > > almost always very good (63-65s), but sometimes "bad" (>300s).
> >
> > If you are interested in digging into this one, the following trace
> > should be helpful:
> >
> > https://lore.kernel.org/linux-scsi/[email protected]/T/#m5aa008626e07913172ad40e1eb8e5f2ffd560fc6
> >
>
> Attached:
> - trace_xfs_20191223.txt (7 trials, then aborted while doing the 8th),
> times to complete:
> 64s
> 63s
> 64s
> 833s
> 1105s
> 63s
> 64s

oops, looks we have to collect io insert trace with the following bcc script
on xfs for confirming if there is similar issue with ext4, could you run
it again on xfs? And only post the trace done in case of slow 'cp'.


#!/bin/sh

MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))

/usr/share/bcc/tools/trace -t -C \
't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'


Thanks,
Ming


2019-12-23 16:27:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

On Mon, Dec 23, 2019 at 09:08:28PM +0800, Ming Lei wrote:
>
> From the above trace:
>
> b'blk_mq_sched_request_inserted'
> b'blk_mq_sched_request_inserted'
> b'dd_insert_requests'
> b'blk_mq_sched_insert_requests'
> b'blk_mq_flush_plug_list'
> b'blk_flush_plug_list'
> b'io_schedule_prepare'
> b'io_schedule'
> b'rq_qos_wait'
> b'wbt_wait'
> b'__rq_qos_throttle'
> b'blk_mq_make_request'
> b'generic_make_request'
> b'submit_bio'
> b'ext4_io_submit'
> b'ext4_writepages'
> b'do_writepages'
> b'__filemap_fdatawrite_range'
> b'ext4_release_file'
> b'__fput'
> b'task_work_run'
> b'exit_to_usermode_loop'
> b'do_syscall_64'
> b'entry_SYSCALL_64_after_hwframe'
> b'cp' [19863]
> 4400
>
> So this write is clearly from 'cp' process, and it should be one
> ext4 fs issue.

We need a system call trace of the cp process, to understand what
system call is resulting in fput, (eg., I assume it's close(2) but
let's be sure), and often it's calling that system call.

What cp process is it? Is it from shellutils? Is it from busybox?

- Ted

2019-12-23 16:29:47

by Andrea Vai

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Il giorno lun, 23/12/2019 alle 11.26 -0500, Theodore Y. Ts'o ha
scritto:
> On Mon, Dec 23, 2019 at 09:08:28PM +0800, Ming Lei wrote:
> >
> > From the above trace:
> >
> > b'blk_mq_sched_request_inserted'
> > b'blk_mq_sched_request_inserted'
> > b'dd_insert_requests'
> > b'blk_mq_sched_insert_requests'
> > b'blk_mq_flush_plug_list'
> > b'blk_flush_plug_list'
> > b'io_schedule_prepare'
> > b'io_schedule'
> > b'rq_qos_wait'
> > b'wbt_wait'
> > b'__rq_qos_throttle'
> > b'blk_mq_make_request'
> > b'generic_make_request'
> > b'submit_bio'
> > b'ext4_io_submit'
> > b'ext4_writepages'
> > b'do_writepages'
> > b'__filemap_fdatawrite_range'
> > b'ext4_release_file'
> > b'__fput'
> > b'task_work_run'
> > b'exit_to_usermode_loop'
> > b'do_syscall_64'
> > b'entry_SYSCALL_64_after_hwframe'
> > b'cp' [19863]
> > 4400
> >
> > So this write is clearly from 'cp' process, and it should be one
> > ext4 fs issue.
>
> We need a system call trace of the cp process, to understand what
> system call is resulting in fput, (eg., I assume it's close(2) but
> let's be sure), and often it's calling that system call.
>
> What cp process is it? Is it from shellutils? Is it from busybox?
>
> - Ted

I run the cp command from a bash script, or from a bash shell. I don't
know if this answer your question, otherwise feel free to tell me a
way to find the answer to give you.

Thanks,
Andrea

2019-12-23 17:23:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

On Mon, Dec 23, 2019 at 05:29:27PM +0100, Andrea Vai wrote:
> I run the cp command from a bash script, or from a bash shell. I don't
> know if this answer your question, otherwise feel free to tell me a
> way to find the answer to give you.

What distro are you using, and/or what package is the cp command
coming from, and what is the package name and version?

Also, can you remind me what the bash script is and how many files you are copying?

Can you change the script so that the cp command is prefixed by:

"strace -tTf -o /tmp/st "

e.g.,

strace -tTf -o /tmp/st cp <args>

And then send me the /tmp/st file. This will significantly change the
time, so don't do this for measuring performance. I just want to see
what the /bin/cp command is *doing*.

- Ted

2019-12-23 18:46:30

by Andrea Vai

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Il giorno lun, 23/12/2019 alle 12.22 -0500, Theodore Y. Ts'o ha
scritto:
> On Mon, Dec 23, 2019 at 05:29:27PM +0100, Andrea Vai wrote:
> > I run the cp command from a bash script, or from a bash shell. I
> don't
> > know if this answer your question, otherwise feel free to tell me
> a
> > way to find the answer to give you.
>
> What distro are you using, and/or what package is the cp command
> coming from, and what is the package name and version?

Fedora 30

$ rpm -qf `which cp`
coreutils-8.31-6.fc30.x86_64

>
> Also, can you remind me what the bash script is and how many files
> you are copying?

basically, it's:

mount UUID=$uuid /mnt/pendrive
SECONDS=0
cp $testfile /mnt/pendrive
umount /mnt/pendrive
tempo=$SECONDS

and it copies one file only. Anyway, you can find the whole script
attached.


>
> Can you change the script so that the cp command is prefixed by:
>
> "strace -tTf -o /tmp/st "
>
> e.g.,
>
> strace -tTf -o /tmp/st cp <args>
>
> And then send me
btw, please tell me if "me" means only you or I cc: all the
recipients, as usual

> the /tmp/st file. This will significantly change the
> time, so don't do this for measuring performance. I just want to
> see
> what the /bin/cp command is *doing*.

I will do it, but I have a doubt. Since the problem doesn't happen
every time, is it useful to give you a trace of a "fast" run? And, if
it's not, I think I should measure performance with the trace command
prefix, to identify a "slow" run to report you. Does it make sense?

Thanks,
Andrea


Attachments:
test (1.37 kB)

2019-12-23 19:53:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> basically, it's:
>
> mount UUID=$uuid /mnt/pendrive
> SECONDS=0
> cp $testfile /mnt/pendrive
> umount /mnt/pendrive
> tempo=$SECONDS
>
> and it copies one file only. Anyway, you can find the whole script
> attached.

OK, so whether we are doing the writeback at the end of cp, or when
you do the umount, it's probably not going to make any difference. We
can get rid of the stack trace in question by changing the script to
be basically:

mount UUID=$uuid /mnt/pendrive
SECONDS=0
rm -f /mnt/pendrive/$testfile
cp $testfile /mnt/pendrive
umount /mnt/pendrive
tempo=$SECONDS

I predict if you do that, you'll see that all of the time is spent in
the umount, when we are trying to write back the file.

I really don't think then this is a file system problem at all. It's
just that USB I/O is slow, for whatever reason. We'll see a stack
trace in the writeback code waiting for the I/O to be completed, but
that doesn't mean that the root cause is in the writeback code or in
the file system which is triggering the writeback.

I suspect the next step is use a blktrace, to see what kind of I/O is
being sent to the USB drive, and how long it takes for the I/O to
complete. You might also try to capture the output of "iostat -x 1"
while the script is running, and see what the difference might be
between a kernel version that has the problem and one that doesn't,
and see if that gives us a clue.

> > And then send me
> btw, please tell me if "me" means only you or I cc: all the
> recipients, as usual

Well, I don't think we know what the root cause is. Ming is focusing
on that stack trace, but I think it's a red herring..... And if it's
not a file system problem, then other people will be best suited to
debug the issue.

- Ted

2019-12-24 01:28:02

by Ming Lei

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Hi Ted,

On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > basically, it's:
> >
> > mount UUID=$uuid /mnt/pendrive
> > SECONDS=0
> > cp $testfile /mnt/pendrive
> > umount /mnt/pendrive
> > tempo=$SECONDS
> >
> > and it copies one file only. Anyway, you can find the whole script
> > attached.
>
> OK, so whether we are doing the writeback at the end of cp, or when
> you do the umount, it's probably not going to make any difference. We
> can get rid of the stack trace in question by changing the script to
> be basically:
>
> mount UUID=$uuid /mnt/pendrive
> SECONDS=0
> rm -f /mnt/pendrive/$testfile
> cp $testfile /mnt/pendrive
> umount /mnt/pendrive
> tempo=$SECONDS
>
> I predict if you do that, you'll see that all of the time is spent in
> the umount, when we are trying to write back the file.
>
> I really don't think then this is a file system problem at all. It's
> just that USB I/O is slow, for whatever reason. We'll see a stack
> trace in the writeback code waiting for the I/O to be completed, but
> that doesn't mean that the root cause is in the writeback code or in
> the file system which is triggering the writeback.

Wrt. the slow write on this usb storage, it is caused by two writeback
path, one is the writeback wq, another is from ext4_release_file() which
is triggered from exit_to_usermode_loop().

When the two write path is run concurrently, the sequential write order
is broken, then write performance drops much on this particular usb
storage.

The ext4_release_file() should be run from read() or write() syscall if
Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected behavior
for ext4_release_file() to be run thousands of times when just
running 'cp' once, see comment of ext4_release_file():

/*
* Called when an inode is released. Note that this is different
* from ext4_file_open: open gets called at every open, but release
* gets called only when /all/ the files are closed.
*/
static int ext4_release_file(struct inode *inode, struct file *filp)

>
> I suspect the next step is use a blktrace, to see what kind of I/O is
> being sent to the USB drive, and how long it takes for the I/O to
> complete. You might also try to capture the output of "iostat -x 1"
> while the script is running, and see what the difference might be
> between a kernel version that has the problem and one that doesn't,
> and see if that gives us a clue.

That isn't necessary, given we have concluded that the bad write
performance is caused by broken write order.

>
> > > And then send me
> > btw, please tell me if "me" means only you or I cc: all the
> > recipients, as usual
>
> Well, I don't think we know what the root cause is. Ming is focusing
> on that stack trace, but I think it's a red herring..... And if it's
> not a file system problem, then other people will be best suited to
> debug the issue.

So far, the reason points to the extra writeback path from exit_to_usermode_loop().
If it is not from close() syscall, the issue should be related with file reference
count. If it is from close() syscall, the issue might be in 'cp''s
implementation.

Andrea, please collect the following log or the strace log requested by Ted, then
we can confirm if the extra writeback is from close() or read/write() syscall:

# pass PID of 'cp' to this script
#!/bin/sh
PID=$1
/usr/share/bcc/tools/trace -P $PID -t -C \
't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
't:syscalls:sys_exit_close ' \
't:syscalls:sys_exit_read ' \
't:syscalls:sys_exit_write '


Thanks,
Ming

2019-12-24 06:50:09

by Andrea Vai

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Il giorno mar, 24/12/2019 alle 09.27 +0800, Ming Lei ha scritto:
> Hi Ted,
>
> On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> > On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > > basically, it's:
> > >
> > > mount UUID=$uuid /mnt/pendrive
> > > SECONDS=0
> > > cp $testfile /mnt/pendrive
> > > umount /mnt/pendrive
> > > tempo=$SECONDS
> > >
> > > and it copies one file only. Anyway, you can find the whole
> script
> > > attached.
> >
> > OK, so whether we are doing the writeback at the end of cp, or
> when
> > you do the umount, it's probably not going to make any
> difference. We
> > can get rid of the stack trace in question by changing the script
> to
> > be basically:
> >
> > mount UUID=$uuid /mnt/pendrive
> > SECONDS=0
> > rm -f /mnt/pendrive/$testfile
> > cp $testfile /mnt/pendrive
> > umount /mnt/pendrive
> > tempo=$SECONDS
> >
> > I predict if you do that, you'll see that all of the time is spent
> in
> > the umount, when we are trying to write back the file.
> >
> > I really don't think then this is a file system problem at
> all. It's
> > just that USB I/O is slow, for whatever reason. We'll see a stack
> > trace in the writeback code waiting for the I/O to be completed,
> but
> > that doesn't mean that the root cause is in the writeback code or
> in
> > the file system which is triggering the writeback.
>
> Wrt. the slow write on this usb storage, it is caused by two
> writeback
> path, one is the writeback wq, another is from ext4_release_file()
> which
> is triggered from exit_to_usermode_loop().
>
> When the two write path is run concurrently, the sequential write
> order
> is broken, then write performance drops much on this particular usb
> storage.
>
> The ext4_release_file() should be run from read() or write() syscall
> if
> Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected
> behavior
> for ext4_release_file() to be run thousands of times when just
> running 'cp' once, see comment of ext4_release_file():
>
> /*
> * Called when an inode is released. Note that this is
> different
> * from ext4_file_open: open gets called at every open, but
> release
> * gets called only when /all/ the files are closed.
> */
> static int ext4_release_file(struct inode *inode, struct file
> *filp)
>
> >
> > I suspect the next step is use a blktrace, to see what kind of I/O
> is
> > being sent to the USB drive, and how long it takes for the I/O to
> > complete. You might also try to capture the output of "iostat -x
> 1"
> > while the script is running, and see what the difference might be
> > between a kernel version that has the problem and one that
> doesn't,
> > and see if that gives us a clue.
>
> That isn't necessary, given we have concluded that the bad write
> performance is caused by broken write order.
>
> >
> > > > And then send me
> > > btw, please tell me if "me" means only you or I cc: all the
> > > recipients, as usual
> >
> > Well, I don't think we know what the root cause is. Ming is
> focusing
> > on that stack trace, but I think it's a red herring..... And if
> it's
> > not a file system problem, then other people will be best suited
> to
> > debug the issue.
>
> So far, the reason points to the extra writeback path from
> exit_to_usermode_loop().
> If it is not from close() syscall, the issue should be related with
> file reference
> count. If it is from close() syscall, the issue might be in 'cp''s
> implementation.
>
> Andrea, please collect the following log or the strace log requested
> by Ted, then
> we can confirm if the extra writeback is from close() or
> read/write() syscall:
>
> # pass PID of 'cp' to this script
> #!/bin/sh
> PID=$1
> /usr/share/bcc/tools/trace -P $PID -t -C \
> 't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector,
> args->nr_sector' \
> 't:syscalls:sys_exit_close ' \
> 't:syscalls:sys_exit_read ' \
> 't:syscalls:sys_exit_write '

Sorry if I am a bit confused, should I run it on ext4 or xfs, or
doesn't matter? What if I get it on a "fast" run? Should I throw it
away and try again until I get a slow one, or it doesn't matter?

Thanks,
Andrea

2019-12-24 08:52:22

by Andrea Vai

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Il giorno mar, 24/12/2019 alle 09.27 +0800, Ming Lei ha scritto:
> Hi Ted,
>
> On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> > On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > > basically, it's:
> > >
> > > mount UUID=$uuid /mnt/pendrive
> > > SECONDS=0
> > > cp $testfile /mnt/pendrive
> > > umount /mnt/pendrive
> > > tempo=$SECONDS
> > >
> > > and it copies one file only. Anyway, you can find the whole
> script
> > > attached.
> >
> > OK, so whether we are doing the writeback at the end of cp, or
> when
> > you do the umount, it's probably not going to make any
> difference. We
> > can get rid of the stack trace in question by changing the script
> to
> > be basically:
> >
> > mount UUID=$uuid /mnt/pendrive
> > SECONDS=0
> > rm -f /mnt/pendrive/$testfile
> > cp $testfile /mnt/pendrive
> > umount /mnt/pendrive
> > tempo=$SECONDS
> >
> > I predict if you do that, you'll see that all of the time is spent
> in
> > the umount, when we are trying to write back the file.
> >
> > I really don't think then this is a file system problem at
> all. It's
> > just that USB I/O is slow, for whatever reason. We'll see a stack
> > trace in the writeback code waiting for the I/O to be completed,
> but
> > that doesn't mean that the root cause is in the writeback code or
> in
> > the file system which is triggering the writeback.
>
> Wrt. the slow write on this usb storage, it is caused by two
> writeback
> path, one is the writeback wq, another is from ext4_release_file()
> which
> is triggered from exit_to_usermode_loop().
>
> When the two write path is run concurrently, the sequential write
> order
> is broken, then write performance drops much on this particular usb
> storage.
>
> The ext4_release_file() should be run from read() or write() syscall
> if
> Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected
> behavior
> for ext4_release_file() to be run thousands of times when just
> running 'cp' once, see comment of ext4_release_file():
>
> /*
> * Called when an inode is released. Note that this is
> different
> * from ext4_file_open: open gets called at every open, but
> release
> * gets called only when /all/ the files are closed.
> */
> static int ext4_release_file(struct inode *inode, struct file
> *filp)
>
> >
> > I suspect the next step is use a blktrace, to see what kind of I/O
> is
> > being sent to the USB drive, and how long it takes for the I/O to
> > complete. You might also try to capture the output of "iostat -x
> 1"
> > while the script is running, and see what the difference might be
> > between a kernel version that has the problem and one that
> doesn't,
> > and see if that gives us a clue.
>
> That isn't necessary, given we have concluded that the bad write
> performance is caused by broken write order.
>
> >
> > > > And then send me
> > > btw, please tell me if "me" means only you or I cc: all the
> > > recipients, as usual
> >
> > Well, I don't think we know what the root cause is. Ming is
> focusing
> > on that stack trace, but I think it's a red herring..... And if
> it's
> > not a file system problem, then other people will be best suited
> to
> > debug the issue.
>
> So far, the reason points to the extra writeback path from
> exit_to_usermode_loop().
> If it is not from close() syscall, the issue should be related with
> file reference
> count. If it is from close() syscall, the issue might be in 'cp''s
> implementation.
>
> Andrea, please collect the following log or the strace log requested
> by Ted, then
> we can confirm if the extra writeback is from close() or
> read/write() syscall:
>
> # pass PID of 'cp' to this script
> #!/bin/sh
> PID=$1
> /usr/share/bcc/tools/trace -P $PID -t -C \
> 't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector,
> args->nr_sector' \
> 't:syscalls:sys_exit_close ' \
> 't:syscalls:sys_exit_read ' \
> 't:syscalls:sys_exit_write '

Meanwhile, I tried to run the test and obtained an error (...usage:
trace [-h] [-b BUFFER_PAGES] [-p PID]...), so assumed the "-P" should
be "-p", corrected and obtained the attached log with ext4 and a slow
copy (2482 seconds) by doing:

- start the test
- look at the cp pid
- run the trace
- wait for the test to finish
- stop the trace.

Thanks,
Andrea


Attachments:
20191224_test_ming.zip (20.34 kB)

2019-12-24 09:36:25

by Ming Lei

[permalink] [raw]
Subject: Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

On Tue, Dec 24, 2019 at 09:51:16AM +0100, Andrea Vai wrote:
> Il giorno mar, 24/12/2019 alle 09.27 +0800, Ming Lei ha scritto:
> > Hi Ted,
> >
> > On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> > > On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > > > basically, it's:
> > > >
> > > > mount UUID=$uuid /mnt/pendrive
> > > > SECONDS=0
> > > > cp $testfile /mnt/pendrive
> > > > umount /mnt/pendrive
> > > > tempo=$SECONDS
> > > >
> > > > and it copies one file only. Anyway, you can find the whole
> > script
> > > > attached.
> > >
> > > OK, so whether we are doing the writeback at the end of cp, or
> > when
> > > you do the umount, it's probably not going to make any
> > difference. We
> > > can get rid of the stack trace in question by changing the script
> > to
> > > be basically:
> > >
> > > mount UUID=$uuid /mnt/pendrive
> > > SECONDS=0
> > > rm -f /mnt/pendrive/$testfile
> > > cp $testfile /mnt/pendrive
> > > umount /mnt/pendrive
> > > tempo=$SECONDS
> > >
> > > I predict if you do that, you'll see that all of the time is spent
> > in
> > > the umount, when we are trying to write back the file.
> > >
> > > I really don't think then this is a file system problem at
> > all. It's
> > > just that USB I/O is slow, for whatever reason. We'll see a stack
> > > trace in the writeback code waiting for the I/O to be completed,
> > but
> > > that doesn't mean that the root cause is in the writeback code or
> > in
> > > the file system which is triggering the writeback.
> >
> > Wrt. the slow write on this usb storage, it is caused by two
> > writeback
> > path, one is the writeback wq, another is from ext4_release_file()
> > which
> > is triggered from exit_to_usermode_loop().
> >
> > When the two write path is run concurrently, the sequential write
> > order
> > is broken, then write performance drops much on this particular usb
> > storage.
> >
> > The ext4_release_file() should be run from read() or write() syscall
> > if
> > Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected
> > behavior
> > for ext4_release_file() to be run thousands of times when just
> > running 'cp' once, see comment of ext4_release_file():
> >
> > /*
> > * Called when an inode is released. Note that this is
> > different
> > * from ext4_file_open: open gets called at every open, but
> > release
> > * gets called only when /all/ the files are closed.
> > */
> > static int ext4_release_file(struct inode *inode, struct file
> > *filp)
> >
> > >
> > > I suspect the next step is use a blktrace, to see what kind of I/O
> > is
> > > being sent to the USB drive, and how long it takes for the I/O to
> > > complete. You might also try to capture the output of "iostat -x
> > 1"
> > > while the script is running, and see what the difference might be
> > > between a kernel version that has the problem and one that
> > doesn't,
> > > and see if that gives us a clue.
> >
> > That isn't necessary, given we have concluded that the bad write
> > performance is caused by broken write order.
> >
> > >
> > > > > And then send me
> > > > btw, please tell me if "me" means only you or I cc: all the
> > > > recipients, as usual
> > >
> > > Well, I don't think we know what the root cause is. Ming is
> > focusing
> > > on that stack trace, but I think it's a red herring..... And if
> > it's
> > > not a file system problem, then other people will be best suited
> > to
> > > debug the issue.
> >
> > So far, the reason points to the extra writeback path from
> > exit_to_usermode_loop().
> > If it is not from close() syscall, the issue should be related with
> > file reference
> > count. If it is from close() syscall, the issue might be in 'cp''s
> > implementation.
> >
> > Andrea, please collect the following log or the strace log requested
> > by Ted, then
> > we can confirm if the extra writeback is from close() or
> > read/write() syscall:
> >
> > # pass PID of 'cp' to this script
> > #!/bin/sh
> > PID=$1
> > /usr/share/bcc/tools/trace -P $PID -t -C \
> > 't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector,
> > args->nr_sector' \
> > 't:syscalls:sys_exit_close ' \
> > 't:syscalls:sys_exit_read ' \
> > 't:syscalls:sys_exit_write '
>
> Meanwhile, I tried to run the test and obtained an error (...usage:
> trace [-h] [-b BUFFER_PAGES] [-p PID]...), so assumed the "-P" should
> be "-p", corrected and obtained the attached log with ext4 and a slow
> copy (2482 seconds) by doing:
>
> - start the test
> - look at the cp pid
> - run the trace
> - wait for the test to finish
> - stop the trace.

The log shows all io submission is from close() syscall, so fs code
is fine, and I have provided the reason of this issue in last email:

https://lore.kernel.org/linux-scsi/[email protected]/T/#m845caca2969da5676516c35dc0c3528a79beb886

Thanks,
Ming