2019-08-03 14:16:30

by Jinpu Wang

[permalink] [raw]
Subject: Kernel 4.14 + has 100 times higher IO latency than Kernel 4.4 with raid1

Hi All,

We found a problem regarding much higher IO latency when running
kernel 4.4.131 compare to 4.14.133, tried with latest upstream
5.3-rc2, same result.

Reproducer:
1 create md raid1 with 2 ram disks:
sudo mdadm -C /dev/md0 -l1 -n2 -e1.2 --bitmap=internal /dev/ram[0-1]
2 run fio command with rate_iops:
fio --rw=write --ioengine=libaio --iodepth=32 --size=1000MB
--rate_iops=5000 --direct=1 --numjobs=1 --runtime=20 --time_based
--name=write-test --filename=/dev/md0

result on 4.4 kernel:
root@ib2:~# fio --rw=write --ioengine=libaio --iodepth=32
--size=1000MB --rate_iops=5000 --direct=1 --numjobs=1 --runtime=20
--time_based --name=write-test --filename=/dev/md0
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1), CR=5000/0 IOPS: [W(1)] [100.0% done] [0KB/20008KB/0KB
/s] [0/5002/0 iops] [eta 00m:00s]
write-test: (groupid=0, jobs=1): err= 0: pid=3351: Fri Aug 2 15:31:26 2019
write: io=400004KB, bw=19999KB/s, iops=4999, runt= 20001msec
slat (usec): min=3, max=26, avg= 3.12, stdev= 0.36
clat (usec): min=0, max=116, avg= 2.04, stdev= 1.33
lat (usec): min=3, max=141, avg= 5.19, stdev= 1.39
clat percentiles (usec):
| 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 2], 20.00th=[ 2],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2],
| 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 3],
| 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 3], 99.95th=[ 3],
| 99.99th=[ 86]
bw (KB /s): min=20000, max=20008, per=100.00%, avg=20005.54, stdev= 3.74
lat (usec) : 2=3.37%, 4=96.60%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%
cpu : usr=4.25%, sys=0.00%, ctx=198550, majf=0, minf=11
IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=100001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=400004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=20001msec, maxt=20001msec

Disk stats (read/write):
md0: ios=61/99539, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%

result on 5.3 kernel
root@ib1:/home/jwang# fio --rw=write --ioengine=libaio --iodepth=32
--size=1000MB --rate_iops=5 --direct=1 --numjobs=1 --runtime=20
--time_based --name=write-test --filename=/dev/md0
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1), CR=5/0 IOPS: [W(1)] [100.0% done] [0KB/20KB/0KB /s]
[0/5/0 iops] [eta 00m:00s]
write-test: (groupid=0, jobs=1): err= 0: pid=1651: Fri Aug 2 17:16:18 2019
write: io=413696B, bw=20683B/s, iops=5, runt= 20001msec
slat (usec): min=2, max=51803, avg=1028.62, stdev=7250.96
clat (usec): min=0, max=91, avg=17.76, stdev=28.07
lat (usec): min=3, max=51892, avg=1046.50, stdev=7254.89
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 19], 80.00th=[ 44], 90.00th=[ 68], 95.00th=[ 80],
| 99.00th=[ 88], 99.50th=[ 91], 99.90th=[ 91], 99.95th=[ 91],
| 99.99th=[ 91]
bw (KB /s): min= 20, max= 21, per=100.00%, avg=20.04, stdev= 0.21
lat (usec) : 2=67.33%, 10=0.99%, 20=1.98%, 50=11.88%, 100=17.82%
cpu : usr=0.00%, sys=0.00%, ctx=77, majf=0, minf=10
IO depths : 1=68.3%, 2=2.0%, 4=4.0%, 8=7.9%, 16=15.8%, 32=2.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=101/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=404KB, aggrb=20KB/s, minb=20KB/s, maxb=20KB/s,
mint=20001msec, maxt=20001msec

Disk stats (read/write):
md0: ios=0/100, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

During the tests the following kernel parameters are applied:
processor.max_cstate=0 idle=poll mitigations=off

Could anyone give us a hint, what could lead to such a huge difference?

Thank you!
Jack Wang


2019-08-04 02:41:13

by Paul Menzel

[permalink] [raw]
Subject: Re: Kernel 4.14 + has 100 times higher IO latency than Kernel 4.4 with raid1

Dear Jinpu,


On 02.08.19 16:48, Jinpu Wang wrote:

> We found a problem regarding much higher IO latency when running
> kernel 4.4.131 compare to 4.14.133, tried with latest upstream
> 5.3-rc2, same result.
>
> Reproducer:
> 1 create md raid1 with 2 ram disks:
> sudo mdadm -C /dev/md0 -l1 -n2 -e1.2 --bitmap=internal /dev/ram[0-1]
> 2 run fio command with rate_iops:
> fio --rw=write --ioengine=libaio --iodepth=32 --size=1000MB
> --rate_iops=5000 --direct=1 --numjobs=1 --runtime=20 --time_based
> --name=write-test --filename=/dev/md0
>
> result on 4.4 kernel:
> root@ib2:~# fio --rw=write --ioengine=libaio --iodepth=32
> --size=1000MB --rate_iops=5000 --direct=1 --numjobs=1 --runtime=20
> --time_based --name=write-test --filename=/dev/md0
> write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.2.10
> Starting 1 process
> Jobs: 1 (f=1), CR=5000/0 IOPS: [W(1)] [100.0% done] [0KB/20008KB/0KB
> /s] [0/5002/0 iops] [eta 00m:00s]
> write-test: (groupid=0, jobs=1): err= 0: pid=3351: Fri Aug 2 15:31:26 2019
> write: io=400004KB, bw=19999KB/s, iops=4999, runt= 20001msec
> slat (usec): min=3, max=26, avg= 3.12, stdev= 0.36
> clat (usec): min=0, max=116, avg= 2.04, stdev= 1.33
> lat (usec): min=3, max=141, avg= 5.19, stdev= 1.39
> clat percentiles (usec):
> | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 2], 20.00th=[ 2],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2],
> | 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 3],
> | 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 3], 99.95th=[ 3],
> | 99.99th=[ 86]
> bw (KB /s): min=20000, max=20008, per=100.00%, avg=20005.54, stdev= 3.74
> lat (usec) : 2=3.37%, 4=96.60%, 10=0.01%, 20=0.01%, 50=0.01%
> lat (usec) : 100=0.01%, 250=0.01%
> cpu : usr=4.25%, sys=0.00%, ctx=198550, majf=0, minf=11
> IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=100001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> WRITE: io=400004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
> mint=20001msec, maxt=20001msec
>
> Disk stats (read/write):
> md0: ios=61/99539, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> aggrutil=0.00%
>
> result on 5.3 kernel
> root@ib1:/home/jwang# fio --rw=write --ioengine=libaio --iodepth=32
> --size=1000MB --rate_iops=5 --direct=1 --numjobs=1 --runtime=20
> --time_based --name=write-test --filename=/dev/md0
> write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.2.10
> Starting 1 process
> Jobs: 1 (f=1), CR=5/0 IOPS: [W(1)] [100.0% done] [0KB/20KB/0KB /s]
> [0/5/0 iops] [eta 00m:00s]
> write-test: (groupid=0, jobs=1): err= 0: pid=1651: Fri Aug 2 17:16:18 2019
> write: io=413696B, bw=20683B/s, iops=5, runt= 20001msec
> slat (usec): min=2, max=51803, avg=1028.62, stdev=7250.96
> clat (usec): min=0, max=91, avg=17.76, stdev=28.07
> lat (usec): min=3, max=51892, avg=1046.50, stdev=7254.89
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
> | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
> | 70.00th=[ 19], 80.00th=[ 44], 90.00th=[ 68], 95.00th=[ 80],
> | 99.00th=[ 88], 99.50th=[ 91], 99.90th=[ 91], 99.95th=[ 91],
> | 99.99th=[ 91]
> bw (KB /s): min= 20, max= 21, per=100.00%, avg=20.04, stdev= 0.21
> lat (usec) : 2=67.33%, 10=0.99%, 20=1.98%, 50=11.88%, 100=17.82%
> cpu : usr=0.00%, sys=0.00%, ctx=77, majf=0, minf=10
> IO depths : 1=68.3%, 2=2.0%, 4=4.0%, 8=7.9%, 16=15.8%, 32=2.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=101/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> WRITE: io=404KB, aggrb=20KB/s, minb=20KB/s, maxb=20KB/s,
> mint=20001msec, maxt=20001msec
>
> Disk stats (read/write):
> md0: ios=0/100, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> aggrutil=0.00%
> ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> During the tests the following kernel parameters are applied:
> processor.max_cstate=0 idle=poll mitigations=off
>
> Could anyone give us a hint, what could lead to such a huge difference?

Sorry, I have no idea, but as you can easily reproduce it with RAM
disks, bisecting the commit causing this quite easily.

Your tests should also be added to some test suite (kselftest)?


Kind regards,

Paul

2019-08-05 08:35:12

by Jinpu Wang

[permalink] [raw]
Subject: Re: Kernel 4.14 + has 100 times higher IO latency than Kernel 4.4 with raid1

On Fri, Aug 2, 2019 at 9:52 PM Paul Menzel <[email protected]> wrote:
>
> Dear Jinpu,
>
>
> On 02.08.19 16:48, Jinpu Wang wrote:
>
> > We found a problem regarding much higher IO latency when running
> > kernel 4.4.131 compare to 4.14.133, tried with latest upstream
> > 5.3-rc2, same result.
> >
> > Reproducer:
> > 1 create md raid1 with 2 ram disks:
> > sudo mdadm -C /dev/md0 -l1 -n2 -e1.2 --bitmap=internal /dev/ram[0-1]
> > 2 run fio command with rate_iops:
> > fio --rw=write --ioengine=libaio --iodepth=32 --size=1000MB
> > --rate_iops=5000 --direct=1 --numjobs=1 --runtime=20 --time_based
> > --name=write-test --filename=/dev/md0
> >
> > result on 4.4 kernel:
> > root@ib2:~# fio --rw=write --ioengine=libaio --iodepth=32
> > --size=1000MB --rate_iops=5000 --direct=1 --numjobs=1 --runtime=20
> > --time_based --name=write-test --filename=/dev/md0
> > write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> > fio-2.2.10
> > Starting 1 process
> > Jobs: 1 (f=1), CR=5000/0 IOPS: [W(1)] [100.0% done] [0KB/20008KB/0KB
> > /s] [0/5002/0 iops] [eta 00m:00s]
> > write-test: (groupid=0, jobs=1): err= 0: pid=3351: Fri Aug 2 15:31:26 2019
> > write: io=400004KB, bw=19999KB/s, iops=4999, runt= 20001msec
> > slat (usec): min=3, max=26, avg= 3.12, stdev= 0.36
> > clat (usec): min=0, max=116, avg= 2.04, stdev= 1.33
> > lat (usec): min=3, max=141, avg= 5.19, stdev= 1.39
> > clat percentiles (usec):
> > | 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 2], 20.00th=[ 2],
> > | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2],
> > | 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 3],
> > | 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 3], 99.95th=[ 3],
> > | 99.99th=[ 86]
> > bw (KB /s): min=20000, max=20008, per=100.00%, avg=20005.54, stdev= 3.74
> > lat (usec) : 2=3.37%, 4=96.60%, 10=0.01%, 20=0.01%, 50=0.01%
> > lat (usec) : 100=0.01%, 250=0.01%
> > cpu : usr=4.25%, sys=0.00%, ctx=198550, majf=0, minf=11
> > IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued : total=r=0/w=100001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> > latency : target=0, window=0, percentile=100.00%, depth=32
> >
> > Run status group 0 (all jobs):
> > WRITE: io=400004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
> > mint=20001msec, maxt=20001msec
> >
> > Disk stats (read/write):
> > md0: ios=61/99539, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> > aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> > aggrutil=0.00%
> >
> > result on 5.3 kernel
> > root@ib1:/home/jwang# fio --rw=write --ioengine=libaio --iodepth=32
> > --size=1000MB --rate_iops=5 --direct=1 --numjobs=1 --runtime=20
> > --time_based --name=write-test --filename=/dev/md0
> > write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> > fio-2.2.10
> > Starting 1 process
> > Jobs: 1 (f=1), CR=5/0 IOPS: [W(1)] [100.0% done] [0KB/20KB/0KB /s]
> > [0/5/0 iops] [eta 00m:00s]
> > write-test: (groupid=0, jobs=1): err= 0: pid=1651: Fri Aug 2 17:16:18 2019
> > write: io=413696B, bw=20683B/s, iops=5, runt= 20001msec
> > slat (usec): min=2, max=51803, avg=1028.62, stdev=7250.96
> > clat (usec): min=0, max=91, avg=17.76, stdev=28.07
> > lat (usec): min=3, max=51892, avg=1046.50, stdev=7254.89
> > clat percentiles (usec):
> > | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
> > | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
> > | 70.00th=[ 19], 80.00th=[ 44], 90.00th=[ 68], 95.00th=[ 80],
> > | 99.00th=[ 88], 99.50th=[ 91], 99.90th=[ 91], 99.95th=[ 91],
> > | 99.99th=[ 91]
> > bw (KB /s): min= 20, max= 21, per=100.00%, avg=20.04, stdev= 0.21
> > lat (usec) : 2=67.33%, 10=0.99%, 20=1.98%, 50=11.88%, 100=17.82%
> > cpu : usr=0.00%, sys=0.00%, ctx=77, majf=0, minf=10
> > IO depths : 1=68.3%, 2=2.0%, 4=4.0%, 8=7.9%, 16=15.8%, 32=2.0%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued : total=r=0/w=101/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> > latency : target=0, window=0, percentile=100.00%, depth=32
> >
> > Run status group 0 (all jobs):
> > WRITE: io=404KB, aggrb=20KB/s, minb=20KB/s, maxb=20KB/s,
> > mint=20001msec, maxt=20001msec
> >
> > Disk stats (read/write):
> > md0: ios=0/100, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> > aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> > aggrutil=0.00%
> > ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> > ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> >
> > During the tests the following kernel parameters are applied:
> > processor.max_cstate=0 idle=poll mitigations=off
> >
> > Could anyone give us a hint, what could lead to such a huge difference?
>
> Sorry, I have no idea, but as you can easily reproduce it with RAM
> disks, bisecting the commit causing this quite easily.
>
> Your tests should also be added to some test suite (kselftest)?
>
>
> Kind regards,
>
> Paul

Thanks Paul,

I'm bisecting now, will report back the result.

Note: the result for 5.3 was done with rate_iops 5 by mistake, but
with 5000 it doesn't change the result.

2019-08-05 11:56:47

by Jinpu Wang

[permalink] [raw]
Subject: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

Hi Neil,

For the md higher write IO latency problem, I bisected it to these commits:

4ad23a97 MD: use per-cpu counter for writes_pending
210f7cd percpu-refcount: support synchronous switch to atomic mode.

Do you maybe have an idea? How can we fix it?

Regards,
Jack Wang

git bisect log:
git bisect start
# good: [d9560919689d588beccf719452086b5cdf6d6c22] Linux 4.4.157
git bisect good d9560919689d588beccf719452086b5cdf6d6c22
# good: [d9560919689d588beccf719452086b5cdf6d6c22] Linux 4.4.157
git bisect good d9560919689d588beccf719452086b5cdf6d6c22
# good: [b562e44f507e863c6792946e4e1b1449fbbac85d] Linux 4.5
git bisect good b562e44f507e863c6792946e4e1b1449fbbac85d
# bad: [7d80e1218adf6d1aa5270587192789e218fef706] Linux 4.14.136
git bisect bad 7d80e1218adf6d1aa5270587192789e218fef706
# good: [309c675e62d140d46806ca3e6b29f4549076d0d6] hwmon: (pcf8591)
use permission-specific DEVICE_ATTR variants
git bisect good 309c675e62d140d46806ca3e6b29f4549076d0d6
# bad: [8ad06e56dcbc1984ef0ff8f6e3c19982c5809f73] Merge branch 'linus'
of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect bad 8ad06e56dcbc1984ef0ff8f6e3c19982c5809f73
# good: [47d272f0f9887343f4e4d31bb22910b141b96654] Merge tag
'iwlwifi-next-for-kalle-2017-04-26' of
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next
git bisect good 47d272f0f9887343f4e4d31bb22910b141b96654
# bad: [81be3dee96346fbe08c31be5ef74f03f6b63cf68] fs/xattr.c: zero out
memory copied to userspace in getxattr
git bisect bad 81be3dee96346fbe08c31be5ef74f03f6b63cf68
# bad: [2f34c1231bfc9f2550f934acb268ac7315fb3837] Merge tag
'drm-for-v4.12' of git://people.freedesktop.org/~airlied/linux
git bisect bad 2f34c1231bfc9f2550f934acb268ac7315fb3837
# good: [be580e7522eecfcf31c70abdf6fa0ae77b2e293b] Merge tag
'mmc-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
git bisect good be580e7522eecfcf31c70abdf6fa0ae77b2e293b
# good: [f7effef8d6e38d7d3120c604ad7d0b299b349e14] drm/amdgpu: limit
block size to one page
git bisect good f7effef8d6e38d7d3120c604ad7d0b299b349e14
# good: [1676a2b35cd5a548da17d1106fb0d4d238c0d191] drm/i915: Park the
signaler before sleeping
git bisect good 1676a2b35cd5a548da17d1106fb0d4d238c0d191
# good: [0302e28dee643932ee7b3c112ebccdbb9f8ec32c] Merge branch 'next'
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
git bisect good 0302e28dee643932ee7b3c112ebccdbb9f8ec32c
# bad: [d35a878ae1c50977b55e352fd46e36e35add72a0] Merge tag
'for-4.12/dm-changes' of
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
git bisect bad d35a878ae1c50977b55e352fd46e36e35add72a0
# bad: [78b6350dcaadb03b4a2970b16387227ba6744876] md: support
disabling of create-on-open semantics.
git bisect bad 78b6350dcaadb03b4a2970b16387227ba6744876
# bad: [513e2faa0138462ce014e1b0e226ca45c83bc6c1] md: prepare for
managing resync I/O pages in clean way
git bisect bad 513e2faa0138462ce014e1b0e226ca45c83bc6c1
# good: [497280509f32340d90feac030bce18006a3e3605] md/raid5: use
md_write_start to count stripes, not bios
git bisect good 497280509f32340d90feac030bce18006a3e3605
# good: [84dd97a69092cef858483b775f1900d743d796a4] md/raid5: don't
test ->writes_pending in raid5_remove_disk
git bisect good 84dd97a69092cef858483b775f1900d743d796a4
# bad: [0b408baf7f4f3ea94239d021a1f19e60cd8694de] raid5-ppl: silence a
misleading warning message
git bisect bad 0b408baf7f4f3ea94239d021a1f19e60cd8694de
# good: [55cc39f345256af241deb6152ff5c06bedd10f11] md: close a race
with setting mddev->in_sync
git bisect good 55cc39f345256af241deb6152ff5c06bedd10f11
# bad: [4ad23a976413aa57fe5ba7a25953dc35ccca5b71] MD: use per-cpu
counter for writes_pending
git bisect bad 4ad23a976413aa57fe5ba7a25953dc35ccca5b71

diff between bad latency and good latency:
root@ib2:/home/jwang/pb-linux-4.14# diff
md_lat_ib2_4.11.0-rc2-1-storage+_2019_0805_125846.log
md_lat_ib2_4.11.0-rc2-1-storage+_2019_0805_132042.log
5c5
< write-test: (groupid=0, jobs=1): err= 0: pid=3265: Mon Aug 5 12:59:27 2019
---
> write-test: (groupid=0, jobs=1): err= 0: pid=3192: Mon Aug 5 13:21:23 2019
7,9c7,9
< slat (usec): min=2, max=29, avg= 2.36, stdev= 0.49
< clat (usec): min=0, max=94, avg= 0.37, stdev= 1.13
< lat (usec): min=2, max=113, avg= 2.74, stdev= 1.16
---
> slat (usec): min=2, max=59973, avg= 5.25, stdev=375.04
> clat (usec): min=0, max=127, avg= 1.74, stdev=13.00
> lat (usec): min=2, max=60095, avg= 7.00, stdev=376.12
13,20c13,20
< | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
< | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1],
< | 99.99th=[ 67]
< bw (KB /s): min=20000, max=20008, per=100.00%, avg=20004.35, stdev= 4.01
< lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
< lat (usec) : 100=0.02%
< cpu : usr=2.44%, sys=1.49%, ctx=199942, majf=0, minf=11
< IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
---
> | 70.00th=[ 0], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
> | 99.00th=[ 93], 99.50th=[ 120], 99.90th=[ 123], 99.95th=[ 124],
> | 99.99th=[ 125]
> bw (KB /s): min=18042, max=22072, per=100.00%, avg=20007.54, stdev=409.26
> lat (usec) : 2=98.52%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06%
> lat (usec) : 100=0.41%, 250=0.98%
> cpu : usr=1.06%, sys=2.81%, ctx=197106, majf=0, minf=11
> IO depths : 1=98.5%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=1.3%, >=64=0.0%
30c30
< md0: ios=60/199469, merge=0/0, ticks=0/0, in_queue=0,
util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
---
> md0: ios=60/199443, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%

2019-08-05 23:47:29

by NeilBrown

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Mon, Aug 05 2019, Jinpu Wang wrote:

> Hi Neil,
>
> For the md higher write IO latency problem, I bisected it to these commits:
>
> 4ad23a97 MD: use per-cpu counter for writes_pending
> 210f7cd percpu-refcount: support synchronous switch to atomic mode.
>
> Do you maybe have an idea? How can we fix it?

Hmmm.... not sure.

My guess is that the set_in_sync() call from md_check_recovery()
is taking a long time, and is being called too often.

Could you try two experiments please.

1/ set /sys/block/md0/md/safe_mode_delay
to 20 or more. It defaults to about 0.2.

2/ comment out the call the set_in_sync() in md_check_recovery().

Then run the least separately after each of these changes.

I the second one makes a difference, I'd like to know how often it gets
called - and why. The test
if ( ! (
(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
(mddev->external == 0 && mddev->safemode == 1) ||
(mddev->safemode == 2
&& !mddev->in_sync && mddev->recovery_cp == MaxSector)
))
return;

should normally return when doing lots of IO - I'd like to know
which condition causes it to not return.

Thanks,
NeilBrown


Attachments:
signature.asc (847.00 B)

2019-08-06 04:58:33

by riccardofarabia

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

Hello Linux raid folks?

I've joined this email ring to ask for help (I'd be happy to try to
help in return, but I doubt that I could help anyone with advanced raid
issues. Maybe I can return any love in other non-raid ways? Ah well..
I've been using linux raid for a quite a few years, but until now, I
haven't had to rescue data from an array.

I have some data on a 5-disk, raid 10 array that was set-up as the os
for a fedora 28 box that I used as both a hypervisor and workstation /
desktop. I set that up with anaconda, the fedora installer, and I set
the raid level to 10, used lvm2 and xfs as the file system. So there
is one lvm volume / group, on the physical raid, that group has several
partitions for /root, /var, /swap and etc.

I would ordinarily have completd back-ups (actualy I usually use NAS
for all data and never keep data locally) and use the disks as door
stops, but I wouldn't mind getting some pictures and videos back that
are on this array. Also there's a current vm on there that I'd like to
get back if I can. I had to rebuild this box because the raid array
seemed to have failed and the last box wouldn't boot.

Can I ask for help from this group?

I have attached the drives to another box, this one that I'm using now.
It is running fedora 30 (nice os / distro). I can see the drives, and I
can start the array with 4 of 5 disks, using mdadm -assemble and the
devices or the UUID, but I can't seem to mount the array to see the
data.

I've checked all the disks with the SMART test routine of the gnome
disks gui utility. A couple have bad sectors but they all are labeled
as OKAY but the SMART Test...

I'm working through the RAID Recovery pages from the wiki that gave me
this email ring. https://raid.wiki.kernel.org/index.php/RAID_Recovery

Thanks for considering!

Rick Nilsson

On Tue, 2019-08-06 at 09:46 +1000, NeilBrown wrote:
> On Mon, Aug 05 2019, Jinpu Wang wrote:
>
> > Hi Neil,
> >
> > For the md higher write IO latency problem, I bisected it to these
> > commits:
> >
> > 4ad23a97 MD: use per-cpu counter for writes_pending
> > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
> >
> > Do you maybe have an idea? How can we fix it?
>
> Hmmm.... not sure.
>
> My guess is that the set_in_sync() call from md_check_recovery()
> is taking a long time, and is being called too often.
>
> Could you try two experiments please.
>
> 1/ set /sys/block/md0/md/safe_mode_delay
> to 20 or more. It defaults to about 0.2.
>
> 2/ comment out the call the set_in_sync() in md_check_recovery().
>
> Then run the least separately after each of these changes.
>
> I the second one makes a difference, I'd like to know how often it
> gets
> called - and why. The test
> if ( ! (
> (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
> test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> (mddev->external == 0 && mddev->safemode == 1) ||
> (mddev->safemode == 2
> && !mddev->in_sync && mddev->recovery_cp == MaxSector)
> ))
> return;
>
> should normally return when doing lots of IO - I'd like to know
> which condition causes it to not return.
>
> Thanks,
> NeilBrown

2019-08-06 07:05:22

by Paul Menzel

[permalink] [raw]
Subject: Unrelated question and threading (was: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1)

Dear Rick,


It looks like your message is unrelated to the thread at hand.
Therefore, please start a new thread by *not* using the reply feature,
but create a new message in your mail program (MUA).

Please read some mailing list etiquettes on the Web like [1].


Kind regards,

Paul


[1]: https://wiki.openstack.org/wiki/MailingListEtiquette

2019-08-06 07:55:50

by Jinpu Wang

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
>
> On Mon, Aug 05 2019, Jinpu Wang wrote:
>
> > Hi Neil,
> >
> > For the md higher write IO latency problem, I bisected it to these commits:
> >
> > 4ad23a97 MD: use per-cpu counter for writes_pending
> > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
> >
> > Do you maybe have an idea? How can we fix it?
>
> Hmmm.... not sure.
Hi Neil,

Thanks for reply, detailed result in line.
>
> My guess is that the set_in_sync() call from md_check_recovery()
> is taking a long time, and is being called too often.
>
> Could you try two experiments please.
>
Baseline on 5.3-rc3:
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092003.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2621: Tue Aug 6 09:20:44 2019
write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
slat (usec): min=2, max=69992, avg= 5.37, stdev=374.95
clat (usec): min=0, max=147, avg= 2.42, stdev=13.57
lat (usec): min=2, max=70079, avg= 7.84, stdev=376.07
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
| 99.00th=[ 96], 99.50th=[ 125], 99.90th=[ 137], 99.95th=[ 139],
| 99.99th=[ 141]
bw (KB /s): min=18454, max=21608, per=100.00%, avg=20005.15, stdev=352.24
lat (usec) : 2=98.52%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06%
lat (usec) : 100=0.46%, 250=0.94%
cpu : usr=4.64%, sys=0.00%, ctx=197118, majf=0, minf=11
IO depths : 1=98.5%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=1.3%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
md0: ios=60/199436, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


> 1/ set /sys/block/md0/md/safe_mode_delay
> to 20 or more. It defaults to about 0.2.
only set 20 to safe_mode_delay, give a nice improvement.
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092144.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2676: Tue Aug 6 09:22:25 2019
write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
slat (usec): min=2, max=99490, avg= 2.98, stdev=222.46
clat (usec): min=0, max=103, avg= 0.96, stdev= 4.51
lat (usec): min=2, max=99581, avg= 3.99, stdev=222.71
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
| 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 90], 99.95th=[ 91],
| 99.99th=[ 95]
bw (KB /s): min=20000, max=20008, per=100.00%, avg=20001.82, stdev= 3.38
lat (usec) : 2=99.72%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.25%, 250=0.01%
cpu : usr=3.17%, sys=1.48%, ctx=199470, majf=0, minf=11
IO depths : 1=99.7%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
md0: ios=60/199461, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


>
> 2/ comment out the call the set_in_sync() in md_check_recovery().
Only commented out set_in_sync get a better improvement
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093340.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2626: Tue Aug 6 09:34:20 2019
write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
slat (usec): min=2, max=29, avg= 2.49, stdev= 0.72
clat (usec): min=0, max=101, avg= 0.78, stdev= 1.17
lat (usec): min=2, max=117, avg= 3.34, stdev= 1.25
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
| 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1],
| 99.99th=[ 72]
bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.03, stdev= 3.50
lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.02%, 250=0.01%
cpu : usr=4.17%, sys=0.00%, ctx=199951, majf=0, minf=12
IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
md0: ios=60/199435, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

With both applied
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093916.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2709: Tue Aug 6 09:39:57 2019
write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
slat (usec): min=2, max=16, avg= 2.46, stdev= 0.69
clat (usec): min=0, max=100, avg= 0.61, stdev= 1.18
lat (usec): min=2, max=104, avg= 3.12, stdev= 1.33
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
| 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
| 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1],
| 99.99th=[ 70]
bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.73, stdev= 3.82
lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.02%, 250=0.01%
cpu : usr=3.33%, sys=1.31%, ctx=199941, majf=0, minf=12
IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
md0: ios=60/199460, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

>
> Then run the least separately after each of these changes.
>


> I the second one makes a difference, I'd like to know how often it gets
> called - and why. The test
> if ( ! (
> (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
> test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> (mddev->external == 0 && mddev->safemode == 1) ||
> (mddev->safemode == 2
> && !mddev->in_sync && mddev->recovery_cp == MaxSector)
> ))
> return;
>
> should normally return when doing lots of IO - I'd like to know
> which condition causes it to not return.
I will check, and report later today.
Thanks again!

Jack Wang
>
> Thanks,
> NeilBrown

2019-08-06 12:00:23

by Jinpu Wang

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang <[email protected]> wrote:
>
> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
> >
> > On Mon, Aug 05 2019, Jinpu Wang wrote:
> >
> > > Hi Neil,
> > >
> > > For the md higher write IO latency problem, I bisected it to these commits:
> > >
> > > 4ad23a97 MD: use per-cpu counter for writes_pending
> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
> > >
> > > Do you maybe have an idea? How can we fix it?
> >
> > Hmmm.... not sure.
> Hi Neil,
>
> Thanks for reply, detailed result in line.
> >
> > My guess is that the set_in_sync() call from md_check_recovery()
> > is taking a long time, and is being called too often.
> >
> > Could you try two experiments please.
> >
> Baseline on 5.3-rc3:
> root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092003.log
> write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.2.10
> Starting 1 process
>
> write-test: (groupid=0, jobs=1): err= 0: pid=2621: Tue Aug 6 09:20:44 2019
> write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
> slat (usec): min=2, max=69992, avg= 5.37, stdev=374.95
> clat (usec): min=0, max=147, avg= 2.42, stdev=13.57
> lat (usec): min=2, max=70079, avg= 7.84, stdev=376.07
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1],
> | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
> | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
> | 99.00th=[ 96], 99.50th=[ 125], 99.90th=[ 137], 99.95th=[ 139],
> | 99.99th=[ 141]
> bw (KB /s): min=18454, max=21608, per=100.00%, avg=20005.15, stdev=352.24
> lat (usec) : 2=98.52%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06%
> lat (usec) : 100=0.46%, 250=0.94%
> cpu : usr=4.64%, sys=0.00%, ctx=197118, majf=0, minf=11
> IO depths : 1=98.5%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=1.3%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
> mint=40001msec, maxt=40001msec
>
> Disk stats (read/write):
> md0: ios=60/199436, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> aggrutil=0.00%
> ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
>
> > 1/ set /sys/block/md0/md/safe_mode_delay
> > to 20 or more. It defaults to about 0.2.
> only set 20 to safe_mode_delay, give a nice improvement.
> root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092144.log
> write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.2.10
> Starting 1 process
>
> write-test: (groupid=0, jobs=1): err= 0: pid=2676: Tue Aug 6 09:22:25 2019
> write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
> slat (usec): min=2, max=99490, avg= 2.98, stdev=222.46
> clat (usec): min=0, max=103, avg= 0.96, stdev= 4.51
> lat (usec): min=2, max=99581, avg= 3.99, stdev=222.71
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
> | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
> | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
> | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 90], 99.95th=[ 91],
> | 99.99th=[ 95]
> bw (KB /s): min=20000, max=20008, per=100.00%, avg=20001.82, stdev= 3.38
> lat (usec) : 2=99.72%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
> lat (usec) : 100=0.25%, 250=0.01%
> cpu : usr=3.17%, sys=1.48%, ctx=199470, majf=0, minf=11
> IO depths : 1=99.7%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
> mint=40001msec, maxt=40001msec
>
> Disk stats (read/write):
> md0: ios=60/199461, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> aggrutil=0.00%
> ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
>
> >
> > 2/ comment out the call the set_in_sync() in md_check_recovery().
> Only commented out set_in_sync get a better improvement
> root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093340.log
> write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.2.10
> Starting 1 process
>
> write-test: (groupid=0, jobs=1): err= 0: pid=2626: Tue Aug 6 09:34:20 2019
> write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
> slat (usec): min=2, max=29, avg= 2.49, stdev= 0.72
> clat (usec): min=0, max=101, avg= 0.78, stdev= 1.17
> lat (usec): min=2, max=117, avg= 3.34, stdev= 1.25
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
> | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
> | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
> | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1],
> | 99.99th=[ 72]
> bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.03, stdev= 3.50
> lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
> lat (usec) : 100=0.02%, 250=0.01%
> cpu : usr=4.17%, sys=0.00%, ctx=199951, majf=0, minf=12
> IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
> mint=40001msec, maxt=40001msec
>
> Disk stats (read/write):
> md0: ios=60/199435, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> aggrutil=0.00%
> ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> With both applied
> root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093916.log
> write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> fio-2.2.10
> Starting 1 process
>
> write-test: (groupid=0, jobs=1): err= 0: pid=2709: Tue Aug 6 09:39:57 2019
> write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
> slat (usec): min=2, max=16, avg= 2.46, stdev= 0.69
> clat (usec): min=0, max=100, avg= 0.61, stdev= 1.18
> lat (usec): min=2, max=104, avg= 3.12, stdev= 1.33
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
> | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 1], 60.00th=[ 1],
> | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1],
> | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1],
> | 99.99th=[ 70]
> bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.73, stdev= 3.82
> lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
> lat (usec) : 100=0.02%, 250=0.01%
> cpu : usr=3.33%, sys=1.31%, ctx=199941, majf=0, minf=12
> IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
> WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
> mint=40001msec, maxt=40001msec
>
> Disk stats (read/write):
> md0: ios=60/199460, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
> aggrutil=0.00%
> ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> >
> > Then run the least separately after each of these changes.
> >
>
>
> > I the second one makes a difference, I'd like to know how often it gets
> > called - and why. The test
> > if ( ! (
> > (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
> > test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
> > test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> > (mddev->external == 0 && mddev->safemode == 1) ||
> > (mddev->safemode == 2
> > && !mddev->in_sync && mddev->recovery_cp == MaxSector)
> > ))
> > return;
> >
I added following debug message:
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2404,6 +2404,8 @@ static void export_array(struct mddev *mddev)
static bool set_in_sync(struct mddev *mddev)
{
lockdep_assert_held(&mddev->lock);
+ pr_info("md %s in_sync is %d, sb_flags %ld, recovery %ld,
external %d, safemode %d, recovery_cp %llu\n",mdname(mddev),
mddev->in_sync, mddev->sb_flags, mddev->recovery, mddev->external,
mddev->safemode, (unsigned long long)mddev->recovery_cp);
+
if (!mddev->in_sync) {
mddev->sync_checkers++;
spin_unlock(&mddev->lock);
@@ -2411,6 +2413,7 @@ static bool set_in_sync(struct mddev *mddev)
spin_lock(&mddev->lock);
if (!mddev->in_sync &&
percpu_ref_is_zero(&mddev->writes_pending)) {
+ pr_info("set_in_sync\n");
mddev->in_sync = 1;
/*
* Ensure ->in_sync is visible before we clear

The relevant dmesg:
[ 103.976374] md/raid1:md0: not clean -- starting background reconstruction
[ 103.976479] md/raid1:md0: active with 2 out of 2 mirrors
[ 103.976597] md0: detected capacity change from 0 to 535822336
[ 103.976721] md: resync of RAID array md0
[ 104.427369] md md0 in_sync is 0, sb_flags 6, recovery 3, external
0, safemode 0, recovery_cp 393216
[ 105.052186] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 458752
[ 105.127718] set_in_sync
[ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 524288
[ 105.207723] set_in_sync
[ 105.213318] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 589824
[ 105.287717] set_in_sync
[ 105.293299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 655360
[ 105.347718] set_in_sync
[ 105.353267] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 720896
[ 105.397717] set_in_sync
[ 105.403367] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 786432
[ 105.457718] set_in_sync
[ 105.981324] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 851968
[ 106.017718] set_in_sync
[ 106.022784] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 917504
[ 106.057718] set_in_sync
[ 106.063357] md md0 in_sync is 0, sb_flags 2, recovery 3, external
0, safemode 0, recovery_cp 983040
[ 106.117718] set_in_sync
[ 106.122851] md: md0: resync done.
[ 106.123261] md md0 in_sync is 0, sb_flags 5, recovery 19, external
0, safemode 0, recovery_cp 18446744073709551615
[ 106.157741] md md0 in_sync is 0, sb_flags 0, recovery 32, external
0, safemode 0, recovery_cp 18446744073709551615
[ 106.217718] set_in_sync
[ 144.707722] md md0 in_sync is 0, sb_flags 0, recovery 0, external
0, safemode 0, recovery_cp 18446744073709551615
[ 144.767720] set_in_sync

Please let me know if you need further information, I can also test
any draft patch if needed.

Thanks,
Jack Wang




> > should normally return when doing lots of IO - I'd like to know
> > which condition causes it to not return.
> I will check, and report later today.
> Thanks again!
>
> Jack Wang
> >
> > Thanks,
> > NeilBrown

2019-08-06 23:41:28

by NeilBrown

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Tue, Aug 06 2019, Jinpu Wang wrote:

> On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang <[email protected]> wrote:
>>
>> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
>> >
>> > On Mon, Aug 05 2019, Jinpu Wang wrote:
>> >
>> > > Hi Neil,
>> > >
>> > > For the md higher write IO latency problem, I bisected it to these commits:
>> > >
>> > > 4ad23a97 MD: use per-cpu counter for writes_pending
>> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
>> > >
>> > > Do you maybe have an idea? How can we fix it?
>> >
>> > Hmmm.... not sure.
>> Hi Neil,
>>
>> Thanks for reply, detailed result in line.

Thanks for the extra testing.
...
> [ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
> 0, safemode 0, recovery_cp 524288
...

ahh - the resync was still happening. That explains why set_in_sync()
is being called so often. If you wait for sync to complete (or create
the array with --assume-clean) you should see more normal behaviour.

This patch should fix it. I think we can do better but it would be more
complex so no suitable for backports to -stable.

Once you confirm it works, I'll send it upstream with a
Reported-and-Tested-by from you.

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index 24638ccedce4..624cf1ac43dc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8900,6 +8900,7 @@ void md_check_recovery(struct mddev *mddev)

if (mddev_trylock(mddev)) {
int spares = 0;
+ bool try_set_sync = mddev->safemode != 0;

if (!mddev->external && mddev->safemode == 1)
mddev->safemode = 0;
@@ -8945,7 +8946,7 @@ void md_check_recovery(struct mddev *mddev)
}
}

- if (!mddev->external && !mddev->in_sync) {
+ if (try_set_sync && !mddev->external && !mddev->in_sync) {
spin_lock(&mddev->lock);
set_in_sync(mddev);
spin_unlock(&mddev->lock);


Attachments:
signature.asc (847.00 B)

2019-08-07 06:38:33

by Jinpu Wang

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Wed, Aug 7, 2019 at 1:40 AM NeilBrown <[email protected]> wrote:
>
> On Tue, Aug 06 2019, Jinpu Wang wrote:
>
> > On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang <[email protected]> wrote:
> >>
> >> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
> >> >
> >> > On Mon, Aug 05 2019, Jinpu Wang wrote:
> >> >
> >> > > Hi Neil,
> >> > >
> >> > > For the md higher write IO latency problem, I bisected it to these commits:
> >> > >
> >> > > 4ad23a97 MD: use per-cpu counter for writes_pending
> >> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
> >> > >
> >> > > Do you maybe have an idea? How can we fix it?
> >> >
> >> > Hmmm.... not sure.
> >> Hi Neil,
> >>
> >> Thanks for reply, detailed result in line.
>
> Thanks for the extra testing.
> ...
> > [ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
> > 0, safemode 0, recovery_cp 524288
> ...
>
> ahh - the resync was still happening. That explains why set_in_sync()
> is being called so often. If you wait for sync to complete (or create
> the array with --assume-clean) you should see more normal behaviour.
I've updated my tests accordingly, thanks for the hint.
>
> This patch should fix it. I think we can do better but it would be more
> complex so no suitable for backports to -stable.
>
> Once you confirm it works, I'll send it upstream with a
> Reported-and-Tested-by from you.
>
> Thanks,
> NeilBrown

Thanks a lot, Neil, my quick test show, yes, it fixed the problem for me.

I will run more tests to be sure, will report back the test result.

Regards,
Jack Wang

>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 24638ccedce4..624cf1ac43dc 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8900,6 +8900,7 @@ void md_check_recovery(struct mddev *mddev)
>
> if (mddev_trylock(mddev)) {
> int spares = 0;
> + bool try_set_sync = mddev->safemode != 0;
>
> if (!mddev->external && mddev->safemode == 1)
> mddev->safemode = 0;
> @@ -8945,7 +8946,7 @@ void md_check_recovery(struct mddev *mddev)
> }
> }
>
> - if (!mddev->external && !mddev->in_sync) {
> + if (try_set_sync && !mddev->external && !mddev->in_sync) {
> spin_lock(&mddev->lock);
> set_in_sync(mddev);
> spin_unlock(&mddev->lock);

2019-08-07 12:36:27

by Jinpu Wang

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Wed, Aug 7, 2019 at 8:36 AM Jinpu Wang <[email protected]> wrote:
>
> On Wed, Aug 7, 2019 at 1:40 AM NeilBrown <[email protected]> wrote:
> >
> > On Tue, Aug 06 2019, Jinpu Wang wrote:
> >
> > > On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang <[email protected]> wrote:
> > >>
> > >> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
> > >> >
> > >> > On Mon, Aug 05 2019, Jinpu Wang wrote:
> > >> >
> > >> > > Hi Neil,
> > >> > >
> > >> > > For the md higher write IO latency problem, I bisected it to these commits:
> > >> > >
> > >> > > 4ad23a97 MD: use per-cpu counter for writes_pending
> > >> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
> > >> > >
> > >> > > Do you maybe have an idea? How can we fix it?
> > >> >
> > >> > Hmmm.... not sure.
> > >> Hi Neil,
> > >>
> > >> Thanks for reply, detailed result in line.
> >
> > Thanks for the extra testing.
> > ...
> > > [ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
> > > 0, safemode 0, recovery_cp 524288
> > ...
> >
> > ahh - the resync was still happening. That explains why set_in_sync()
> > is being called so often. If you wait for sync to complete (or create
> > the array with --assume-clean) you should see more normal behaviour.
> I've updated my tests accordingly, thanks for the hint.
> >
> > This patch should fix it. I think we can do better but it would be more
> > complex so no suitable for backports to -stable.
> >
> > Once you confirm it works, I'll send it upstream with a
> > Reported-and-Tested-by from you.
> >
> > Thanks,
> > NeilBrown
>
> Thanks a lot, Neil, my quick test show, yes, it fixed the problem for me.
>
> I will run more tests to be sure, will report back the test result.
Hi Neil,

I've run our regression tests with your patch, everything works fine
as expected.

So Reported-and-Tested-by: Jack Wang <[email protected]>

Thank you for your quick fix.

The patch should go to stable 4.12+

Regards,
Jack Wang

>
> Regards,
> Jack Wang
>
> >
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 24638ccedce4..624cf1ac43dc 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -8900,6 +8900,7 @@ void md_check_recovery(struct mddev *mddev)
> >
> > if (mddev_trylock(mddev)) {
> > int spares = 0;
> > + bool try_set_sync = mddev->safemode != 0;
> >
> > if (!mddev->external && mddev->safemode == 1)
> > mddev->safemode = 0;
> > @@ -8945,7 +8946,7 @@ void md_check_recovery(struct mddev *mddev)
> > }
> > }
> >
> > - if (!mddev->external && !mddev->in_sync) {
> > + if (try_set_sync && !mddev->external && !mddev->in_sync) {
> > spin_lock(&mddev->lock);
> > set_in_sync(mddev);
> > spin_unlock(&mddev->lock);

2019-08-16 08:13:42

by Jinpu Wang

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Wed, Aug 7, 2019 at 2:35 PM Jinpu Wang <[email protected]> wrote:
>
> On Wed, Aug 7, 2019 at 8:36 AM Jinpu Wang <[email protected]> wrote:
> >
> > On Wed, Aug 7, 2019 at 1:40 AM NeilBrown <[email protected]> wrote:
> > >
> > > On Tue, Aug 06 2019, Jinpu Wang wrote:
> > >
> > > > On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang <[email protected]> wrote:
> > > >>
> > > >> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
> > > >> >
> > > >> > On Mon, Aug 05 2019, Jinpu Wang wrote:
> > > >> >
> > > >> > > Hi Neil,
> > > >> > >
> > > >> > > For the md higher write IO latency problem, I bisected it to these commits:
> > > >> > >
> > > >> > > 4ad23a97 MD: use per-cpu counter for writes_pending
> > > >> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
> > > >> > >
> > > >> > > Do you maybe have an idea? How can we fix it?
> > > >> >
> > > >> > Hmmm.... not sure.
> > > >> Hi Neil,
> > > >>
> > > >> Thanks for reply, detailed result in line.
> > >
> > > Thanks for the extra testing.
> > > ...
> > > > [ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
> > > > 0, safemode 0, recovery_cp 524288
> > > ...
> > >
> > > ahh - the resync was still happening. That explains why set_in_sync()
> > > is being called so often. If you wait for sync to complete (or create
> > > the array with --assume-clean) you should see more normal behaviour.
> > I've updated my tests accordingly, thanks for the hint.
> > >
> > > This patch should fix it. I think we can do better but it would be more
> > > complex so no suitable for backports to -stable.
> > >
> > > Once you confirm it works, I'll send it upstream with a
> > > Reported-and-Tested-by from you.
> > >
> > > Thanks,
> > > NeilBrown
> >
> > Thanks a lot, Neil, my quick test show, yes, it fixed the problem for me.
> >
> > I will run more tests to be sure, will report back the test result.
> Hi Neil,
>
> I've run our regression tests with your patch, everything works fine
> as expected.
>
> So Reported-and-Tested-by: Jack Wang <[email protected]>
>
> Thank you for your quick fix.
>
> The patch should go to stable 4.12+

Hi Neil,

I hope you're doing well, just a soft ping? do you need further
testing from my side?

Please let me know how can we move the fix forward.

Thanks,
Jack Wang

2019-08-20 00:29:13

by NeilBrown

[permalink] [raw]
Subject: Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

On Fri, Aug 16 2019, Jinpu Wang wrote:

> On Wed, Aug 7, 2019 at 2:35 PM Jinpu Wang <[email protected]> wrote:
>>
>> On Wed, Aug 7, 2019 at 8:36 AM Jinpu Wang <[email protected]> wrote:
>> >
>> > On Wed, Aug 7, 2019 at 1:40 AM NeilBrown <[email protected]> wrote:
>> > >
>> > > On Tue, Aug 06 2019, Jinpu Wang wrote:
>> > >
>> > > > On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang <[email protected]> wrote:
>> > > >>
>> > > >> On Tue, Aug 6, 2019 at 1:46 AM NeilBrown <[email protected]> wrote:
>> > > >> >
>> > > >> > On Mon, Aug 05 2019, Jinpu Wang wrote:
>> > > >> >
>> > > >> > > Hi Neil,
>> > > >> > >
>> > > >> > > For the md higher write IO latency problem, I bisected it to these commits:
>> > > >> > >
>> > > >> > > 4ad23a97 MD: use per-cpu counter for writes_pending
>> > > >> > > 210f7cd percpu-refcount: support synchronous switch to atomic mode.
>> > > >> > >
>> > > >> > > Do you maybe have an idea? How can we fix it?
>> > > >> >
>> > > >> > Hmmm.... not sure.
>> > > >> Hi Neil,
>> > > >>
>> > > >> Thanks for reply, detailed result in line.
>> > >
>> > > Thanks for the extra testing.
>> > > ...
>> > > > [ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external
>> > > > 0, safemode 0, recovery_cp 524288
>> > > ...
>> > >
>> > > ahh - the resync was still happening. That explains why set_in_sync()
>> > > is being called so often. If you wait for sync to complete (or create
>> > > the array with --assume-clean) you should see more normal behaviour.
>> > I've updated my tests accordingly, thanks for the hint.
>> > >
>> > > This patch should fix it. I think we can do better but it would be more
>> > > complex so no suitable for backports to -stable.
>> > >
>> > > Once you confirm it works, I'll send it upstream with a
>> > > Reported-and-Tested-by from you.
>> > >
>> > > Thanks,
>> > > NeilBrown
>> >
>> > Thanks a lot, Neil, my quick test show, yes, it fixed the problem for me.
>> >
>> > I will run more tests to be sure, will report back the test result.
>> Hi Neil,
>>
>> I've run our regression tests with your patch, everything works fine
>> as expected.
>>
>> So Reported-and-Tested-by: Jack Wang <[email protected]>
>>
>> Thank you for your quick fix.
>>
>> The patch should go to stable 4.12+
>
> Hi Neil,
>
> I hope you're doing well, just a soft ping? do you need further
> testing from my side?

Thanks for the reminder. I've sent out the patch now.

NeilBrown


>
> Please let me know how can we move the fix forward.
>
> Thanks,
> Jack Wang


Attachments:
signature.asc (847.00 B)