LinuxLists.cc - A huge latency in ext4 and xfs because of stable page write

2012-12-11 08:45:21

Subject: A huge latency in ext4 and xfs because of stable page write

Hi all,

At Tao Bao we meet a problem in our product system which causes a huge latency
because of stable page write. This problem is easy to reproduce in a testing
environment, and I can reproduce it in my desktop with a SATA disk. Here is the
fio config file that is used to reproduce this problem.

config file
-----------
[global]
iodepth=1
directory=/mnt/sda3
direct=0
group_reporting
thread
fallocate=0
runtime=120

[log-append]
ioengine=sync
rw=write
bs=1k
size=10g
filesize=10g
rate=5m
numjobs=1

I run this test case in ext4 and xfs, and both of them are affected by stable
page write. The result shows as below. Please notice the result of latency.

ext4 w/ stable page write
-------------------------
log-append: (groupid=0, jobs=1): err= 0: pid=3455: Tue Dec 11 14:41:27 2012
write: io=614401KB, bw=5119.1KB/s, iops=5119 , runt=120001msec
clat (usec): min=3 , max=225702 , avg= 7.32, stdev=411.95
lat (usec): min=3 , max=225702 , avg= 7.50, stdev=411.95
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 4],
| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 5],
| 70.00th=[ 6], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14],
| 99.00th=[ 15], 99.50th=[ 16], 99.90th=[ 18], 99.95th=[ 19],
| 99.99th=[ 25]
bw (KB/s) : min= 5108, max= 5134, per=100.00%, avg=5120.92, stdev= 1.58
lat (usec) : 4=15.93%, 10=59.05%, 20=24.98%, 50=0.03%, 100=0.01%
lat (usec) : 250=0.01%
lat (msec) : 20=0.01%, 250=0.01%
cpu : usr=0.18%, sys=4.31%, ctx=118850, majf=18446744073709551605,
minf=18446744073709538970
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=614401/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=614401KB, aggrb=5119KB/s, minb=5119KB/s, maxb=5119KB/s,
mint=120001msec, maxt=120001msec

Disk stats (read/write):
sda: ios=79/1287, merge=5/151907, ticks=702/147202, in_queue=147861, util=6.25%

ext4 w/o stable page write
--------------------------
log-append: (groupid=0, jobs=1): err= 0: pid=3062: Tue Dec 11 14:52:32 2012
write: io=614400KB, bw=5119.1KB/s, iops=5119 , runt=120001msec
clat (usec): min=3 , max=14565 , avg= 6.15, stdev=25.35
lat (usec): min=3 , max=14566 , avg= 6.29, stdev=25.35
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 4],
| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4],
| 70.00th=[ 5], 80.00th=[ 12], 90.00th=[ 12], 95.00th=[ 13],
| 99.00th=[ 15], 99.50th=[ 15], 99.90th=[ 17], 99.95th=[ 18],
| 99.99th=[ 64]
bw (KB/s) : min= 5108, max= 5140, per=100.00%, avg=5122.26, stdev= 4.54
lat (usec) : 4=17.33%, 10=57.65%, 20=24.98%, 50=0.02%, 100=0.01%
lat (usec) : 250=0.01%
lat (msec) : 2=0.01%, 20=0.01%
cpu : usr=4.12%, sys=0.00%, ctx=58904, majf=18446744073709551605,
minf=18446744073709538969
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=614400/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=614400KB, aggrb=5119KB/s, minb=5119KB/s, maxb=5119KB/s,
mint=120001msec, maxt=120001msec

Disk stats (read/write):
sda: ios=34/1254, merge=4/152032, ticks=157/144066, in_queue=147725, util=5.82%

xfs w/ stable page write
------------------------
log-append: (groupid=0, jobs=1): err= 0: pid=2848: Tue Dec 11 15:39:25 2012
write: io=614401KB, bw=5119.1KB/s, iops=5119 , runt=120002msec
clat (usec): min=1 , max=635643 , avg= 5.02, stdev=1141.37
lat (usec): min=1 , max=635643 , avg= 5.17, stdev=1141.37
clat percentiles (usec):
| 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2],
| 70.00th=[ 4], 80.00th=[ 5], 90.00th=[ 6], 95.00th=[ 6],
| 99.00th=[ 9], 99.50th=[ 10], 99.90th=[ 11], 99.95th=[ 12],
| 99.99th=[ 13]
bw (KB/s) : min= 1388, max=11632, per=100.00%, avg=5147.03, stdev=675.82
lat (usec) : 2=21.76%, 4=46.03%, 10=31.65%, 20=0.56%, 50=0.01%
lat (msec) : 20=0.01%, 750=0.01%
cpu : usr=1.09%, sys=1.18%, ctx=58250, majf=18446744073709551605,
minf=18446744073709538970
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=614401/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=614401KB, aggrb=5119KB/s, minb=5119KB/s, maxb=5119KB/s,
mint=120002msec, maxt=120002msec

Disk stats (read/write):
sda: ios=15/1101, merge=0/28, ticks=150/554585, in_queue=554731, util=4.18%

xfs w/o stable page write
-------------------------
log-append: (groupid=0, jobs=1): err= 0: pid=3678: Tue Dec 11 15:01:10 2012
write: io=614401KB, bw=5119.1KB/s, iops=5119 , runt=120001msec
clat (usec): min=1 , max=20866 , avg= 3.50, stdev=26.72
lat (usec): min=1 , max=20866 , avg= 3.60, stdev=26.72
clat percentiles (usec):
| 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 2],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2],
| 70.00th=[ 5], 80.00th=[ 6], 90.00th=[ 6], 95.00th=[ 8],
| 99.00th=[ 10], 99.50th=[ 10], 99.90th=[ 12], 99.95th=[ 12],
| 99.99th=[ 14]
bw (KB/s) : min= 5110, max= 5132, per=100.00%, avg=5120.84, stdev= 2.25
lat (usec) : 2=15.33%, 4=44.93%, 10=38.33%, 20=1.41%, 50=0.01%
lat (usec) : 500=0.01%
lat (msec) : 50=0.01%
cpu : usr=2.83%, sys=0.00%, ctx=119423, majf=18446744073709551605,
minf=18446744073709538968
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=614401/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
WRITE: io=614401KB, aggrb=5119KB/s, minb=5119KB/s, maxb=5119KB/s,
mint=120001msec, maxt=120001msec

Disk stats (read/write):
sda: ios=106/1055, merge=3/35, ticks=1042/522739, in_queue=523780, util=4.56%

We can see that the max of latency is very high with stable page write in ext4
and xfs, and it reduces to a lower value after stable page write is reverted. I
summarize as below.

ext4 xfs (usec, lower is better)
w/ 225702 635643
w/o 14565 20866

Hence, I wonder whether or not we could revert stable page write temporarily.
After it is improved, we could add it back again.

Thanks,
- Zheng

2012-12-11 11:17:11

by Dave Chinner

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

On Tue, Dec 11, 2012 at 04:45:21PM +0800, Zheng Liu wrote:
> Hi all,
>
> At Tao Bao we meet a problem in our product system which causes a huge latency
> because of stable page write. This problem is easy to reproduce in a testing
> environment, and I can reproduce it in my desktop with a SATA disk. Here is the
> fio config file that is used to reproduce this problem.
>
> config file
> -----------
> [global]
> iodepth=1
> directory=/mnt/sda3
> direct=0
> group_reporting
> thread
> fallocate=0
> runtime=120
>
> [log-append]
> ioengine=sync
> rw=write
> bs=1k

Sub-page sized IO. That's guaranteed to have noticable IO latency
anomalies, regardless of stable data pages. If you are just
doing appending writes, then you can easily buffer them till you
have a page of data to write and avoid the problem altogether.

> Hence, I wonder whether or not we could revert stable page write temporarily.
> After it is improved, we could add it back again.

The plan is to turn it off for filesystems/devices that don't
require it. That list of devices will grow in future, so you
probably should plan to handle latencies in the application
properly...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-12-12 04:18:03

by Zheng Liu

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

On Tue, Dec 11, 2012 at 10:17:07PM +1100, Dave Chinner wrote:
> On Tue, Dec 11, 2012 at 04:45:21PM +0800, Zheng Liu wrote:
> > Hi all,
> >
> > At Tao Bao we meet a problem in our product system which causes a huge latency
> > because of stable page write. This problem is easy to reproduce in a testing
> > environment, and I can reproduce it in my desktop with a SATA disk. Here is the
> > fio config file that is used to reproduce this problem.
> >
> > config file
> > -----------
> > [global]
> > iodepth=1
> > directory=/mnt/sda3
> > direct=0
> > group_reporting
> > thread
> > fallocate=0
> > runtime=120
> >
> > [log-append]
> > ioengine=sync
> > rw=write
> > bs=1k
>
> Sub-page sized IO. That's guaranteed to have noticable IO latency
> anomalies, regardless of stable data pages. If you are just
> doing appending writes, then you can easily buffer them till you
> have a page of data to write and avoid the problem altogether.

Hi Dave,

Thanks for your reply. I agree with you that the application still
could be improved. But, unfortunately, the real life is that we cannot
change the application. We couldn't convince the application developers
that they need to change their program to adapt to the latest kernel
because the latest kernel has a new feature that is not useful for them.
That is unacceptable for them. :-(

>
> > Hence, I wonder whether or not we could revert stable page write temporarily.
> > After it is improved, we could add it back again.
>
> The plan is to turn it off for filesystems/devices that don't
> require it. That list of devices will grow in future, so you
> probably should plan to handle latencies in the application
> properly...

I wonder whether we can provide a sysctl to turn on/off stable page
write. At least we need to give sysadmin an opportunity to control it.

Regards,
- Zheng

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2012-12-12 04:49:49

by Dave Chinner

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

On Wed, Dec 12, 2012 at 12:18:03PM +0800, Zheng Liu wrote:
> On Tue, Dec 11, 2012 at 10:17:07PM +1100, Dave Chinner wrote:
> > On Tue, Dec 11, 2012 at 04:45:21PM +0800, Zheng Liu wrote:
> > > Hi all,
> > >
> > > At Tao Bao we meet a problem in our product system which causes a huge latency
> > > because of stable page write. This problem is easy to reproduce in a testing
> > > environment, and I can reproduce it in my desktop with a SATA disk. Here is the
> > > fio config file that is used to reproduce this problem.
> > >
> > > config file
> > > -----------
> > > [global]
> > > iodepth=1
> > > directory=/mnt/sda3
> > > direct=0
> > > group_reporting
> > > thread
> > > fallocate=0
> > > runtime=120
> > >
> > > [log-append]
> > > ioengine=sync
> > > rw=write
> > > bs=1k
> >
> > Sub-page sized IO. That's guaranteed to have noticable IO latency
> > anomalies, regardless of stable data pages. If you are just
> > doing appending writes, then you can easily buffer them till you
> > have a page of data to write and avoid the problem altogether.
>
> Hi Dave,
>
> Thanks for your reply. I agree with you that the application still
> could be improved. But, unfortunately, the real life is that we cannot
> change the application. We couldn't convince the application developers
> that they need to change their program to adapt to the latest kernel
> because the latest kernel has a new feature that is not useful for them.
> That is unacceptable for them. :-(

Too bad for them, then. The kernel can't give everyone a pony....

> > > Hence, I wonder whether or not we could revert stable page write temporarily.
> > > After it is improved, we could add it back again.
> >
> > The plan is to turn it off for filesystems/devices that don't
> > require it. That list of devices will grow in future, so you
> > probably should plan to handle latencies in the application
> > properly...
>
> I wonder whether we can provide a sysctl to turn on/off stable page
> write. At least we need to give sysadmin an opportunity to control it.

That's already been considered and discarded because turning off
stable pages on devices that require it will cause validation or
data corruption problems. The discussion was for these patches (and
I think a followup series as well):

http://www.spinics.net/lists/linux-fsdevel/msg59421.html

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-12-12 05:13:32

by Zheng Liu

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

On Wed, Dec 12, 2012 at 03:49:49PM +1100, Dave Chinner wrote:
[cut...]
> > > > Hence, I wonder whether or not we could revert stable page write temporarily.
> > > > After it is improved, we could add it back again.
> > >
> > > The plan is to turn it off for filesystems/devices that don't
> > > require it. That list of devices will grow in future, so you
> > > probably should plan to handle latencies in the application
> > > properly...
> >
> > I wonder whether we can provide a sysctl to turn on/off stable page
> > write. At least we need to give sysadmin an opportunity to control it.
>
> That's already been considered and discarded because turning off
> stable pages on devices that require it will cause validation or
> data corruption problems. The discussion was for these patches (and
> I think a followup series as well):
>
> http://www.spinics.net/lists/linux-fsdevel/msg59421.html

Thanks for pointing out. So now it seems that only I can do is to
present a proposal to revert stable page write temporarily because it
causes a huge latency for some applications.

Thanks,
- Zheng

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2012-12-12 07:23:43

by Stefan Ring

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

>> http://www.spinics.net/lists/linux-fsdevel/msg59421.html
>
> Thanks for pointing out. So now it seems that only I can do is to
> present a proposal to revert stable page write temporarily because it
> causes a huge latency for some applications.

There has just been a write-up on the topic of stable pages in last
week’s LWN: http://lwn.net/Articles/528031/

The link should become available for non-subscribers tomorrow.

2012-12-12 08:04:03

by Zheng Liu

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

2012-12-12 19:47:10

by Darrick J. Wong

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

On Wed, Dec 12, 2012 at 01:13:32PM +0800, Zheng Liu wrote:
> On Wed, Dec 12, 2012 at 03:49:49PM +1100, Dave Chinner wrote:
> [cut...]
> > > > > Hence, I wonder whether or not we could revert stable page write temporarily.
> > > > > After it is improved, we could add it back again.
> > > >
> > > > The plan is to turn it off for filesystems/devices that don't
> > > > require it. That list of devices will grow in future, so you
> > > > probably should plan to handle latencies in the application
> > > > properly...
> > >
> > > I wonder whether we can provide a sysctl to turn on/off stable page
> > > write. At least we need to give sysadmin an opportunity to control it.
> >
> > That's already been considered and discarded because turning off
> > stable pages on devices that require it will cause validation or
> > data corruption problems. The discussion was for these patches (and
> > I think a followup series as well):
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg59421.html
>
> Thanks for pointing out. So now it seems that only I can do is to
> present a proposal to revert stable page write temporarily because it
> causes a huge latency for some applications.

Hrm... just to be clear, is your complaint that you have one of these
checksum-happy disks and overwrites are slow on it, or that you have a regular
SATA disk that doesn't require stable pages and you don't want to take the
speed hit?

If it's the second, then let's just push the bdi flag thing upstream. Given
your comments a couple of days ago, I'm pretty sure it's the second, but I
figured I ought to clarify the record. :)

I haven't posted new patches because I've been busy writing a fix for ext3.
It seems to be working, so I'll clean it up and send out a new series.

--D

2012-12-12 23:07:46

by Dave Chinner

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

On Wed, Dec 12, 2012 at 08:23:43AM +0100, Stefan Ring wrote:
> >> http://www.spinics.net/lists/linux-fsdevel/msg59421.html
> >
> > Thanks for pointing out. So now it seems that only I can do is to
> > present a proposal to revert stable page write temporarily because it
> > causes a huge latency for some applications.
>
> There has just been a write-up on the topic of stable pages in last
> week’s LWN: http://lwn.net/Articles/528031/
>
> The link should become available for non-subscribers tomorrow.

This is what subscriber links are for - when a LWN article is
relevant to a specific discussion, a subscriber can use their
discretion to post world readable link to the pay-walled article so
everyone is able to read the article....

This is one of the reasons I really like LWN and the way Jon has set
up subscriptions (and encourage anyone who doesn't subscribe to do
so). The paywall is something I *choose* to participate
in to get articles as quickly as possible, but have the option of
being able to let non-subscribers read articles immediately if *I*
consider it relevant to a discussion. That's definitely the case
here....

That said, a link has aleady been posted, so I don't have to do
anything but cheerlead here. :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-12-13 02:17:11

by Zheng Liu

[permalink] [raw]

Subject: Re: A huge latency in ext4 and xfs because of stable page write

Hi Darrick,

On Wed, Dec 12, 2012 at 11:47:10AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 12, 2012 at 01:13:32PM +0800, Zheng Liu wrote:
> > On Wed, Dec 12, 2012 at 03:49:49PM +1100, Dave Chinner wrote:
> > [cut...]
> > > > > > Hence, I wonder whether or not we could revert stable page write temporarily.
> > > > > > After it is improved, we could add it back again.
> > > > >
> > > > > The plan is to turn it off for filesystems/devices that don't
> > > > > require it. That list of devices will grow in future, so you
> > > > > probably should plan to handle latencies in the application
> > > > > properly...
> > > >
> > > > I wonder whether we can provide a sysctl to turn on/off stable page
> > > > write. At least we need to give sysadmin an opportunity to control it.
> > >
> > > That's already been considered and discarded because turning off
> > > stable pages on devices that require it will cause validation or
> > > data corruption problems. The discussion was for these patches (and
> > > I think a followup series as well):
> > >
> > > http://www.spinics.net/lists/linux-fsdevel/msg59421.html
> >
> > Thanks for pointing out. So now it seems that only I can do is to
> > present a proposal to revert stable page write temporarily because it
> > causes a huge latency for some applications.
>
> Hrm... just to be clear, is your complaint that you have one of these
> checksum-happy disks and overwrites are slow on it, or that you have a regular
> SATA disk that doesn't require stable pages and you don't want to take the
> speed hit?
>
> If it's the second, then let's just push the bdi flag thing upstream. Given
> your comments a couple of days ago, I'm pretty sure it's the second, but I
> figured I ought to clarify the record. :)

Yes, it's the second. In our product system we have a large number of
SATA disks that don't require stable pages. So that would be great if
we have a method to turn on/off stable page.

>
> I haven't posted new patches because I've been busy writing a fix for ext3.
> It seems to be working, so I'll clean it up and send out a new series.

Cool! We are looking forward the new patch series. :-)

Regards,
- Zheng