by Dongdong Tao

[permalink] [raw]

Subject: Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

Hi Coly,

Apologies for any confusion that I might have caused, and thanks a lot
for your patience and your help !

On Thu, Jan 14, 2021 at 9:31 PM Coly Li <[email protected]> wrote:
>
> On 1/14/21 8:22 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> > Why you limit the iodeph to 8 and iops to 150 on cache device?
> > For cache device the limitation is small. Iosp 150 with 4KB block size,
> > it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> > hours it is only 70GB.
> >
> >
> > What if the iodepth is 128 or 64, and no iops rate limitation ?
> > -> There are two reasons why I limit the iodepth and iops rate.
> > 1. If I don't limit them, the dirty cache will be filled up very
> > quickly within 20 minutes.
> > It's almost NVME speed before it reaches the 70
> > cutoff_writeback_sync, there is no way for any kind of writeback to
> > stop it from
> > filling up due to the huge gap between NVME and HDD in terms of
> > the throughput,
> > I don't think there is anything we can do about it? and it should
> > only happen in a benchmark world, not should in production.
> > The improvement I'm trying to do here is just for normal
> > production workload ,not for this benchmark scenario really.
> > I currently can't see any necessity to test this scenario, please
> > kindly let me know about this if I'm wrong.
> >
> > 2. The reason that I set iodepth to 8 and iops to 150 is based on the
> > experience that I observed from production env, mostly ceph,
> > ceph-osd has less than 10 thread(default setting) that will send
> > io to bcache in parallel. But I'm not sure about other applications.
> > I agree that we can increase the iodepth to 64 or 128 and it's
> > doable. But we have to limit the iops, 150 IOPS is a reasonable
> > workload.
> > The most busy ceph-osd that I've seen is about 1000 IOPS, but on
> > average is still only about 600.
> > I can set the IOPS to a higher value like 600 and the iodepth to
> > 128 to perform the later test if it make sense to you?
> >
>
> OK, now I know the reason with the extra information. Since the cache
> device is filled up within 20 minutes, it is unnecessary to do the
> faster testing on your side. Let me do it later on my hardware.
>
>
> > Lastly, please allow me to clarify more about the production issue
> > that this patch is trying to address:
> >
> > In the production env that hit this issue, it usually takes a very
> > long time (many take days) for the cache_available_percent to drop to
> > 30, and the dirty data is mostly staying at a very low level (around
> > 10 percent), which means that the bcache isn't being stressed very
> > hard most of the time.
> > There is no intention to save the cutoff_writeback_sync when the
> > bcache is being stressed without limitation, hope above make sense :)
> >
>
> Yes you explained clearly previously. What I worried was whether a
> faster writeback may interfere throughput and latency of regular I/O
> regular I/Os.
>
> From your current testing data it looks find with me.
>
>
> > By the way, my colleague and I are trying to gathering some production
> > bcache stats, I hope we can give you the performance number before and
> > after applying the patch.
>
> Yes that will be great.
>
> And could you please gather all current data chats into a single email,
> and reference it in your patch via lore ? Then for people don't
> subscribe linux-bcache mailing list, they may find all the posted
> performance data from you patch.
>

Sounds good, I'll update the patch comment with reference data.
But it seems like the linux mailing list doesn't accept chart ?
(always been detected as SPAM)
But, I can't be sure, I'll try to send it again, but if not, I'll put
all those data into a google doc.

> In general your testing data is convinced IMHO, and I will add your
> updated patch for 5.12 merge window.
>
Thank you Coly, that's great !!!

>
> Thanks.
>
> Coly Li
>
>
> >
> >
> > On Thu, Jan 14, 2021 at 6:05 PM Coly Li <[email protected]> wrote:
> >>
> >> On 1/14/21 12:45 PM, Dongdong Tao wrote:
> >>> Hi Coly,
> >>>
> >>> I've got the testing data for multiple threads with larger IO depth.
> >>>
> >>
> >> Hi Dongdong,
> >>
> >> Thanks for the testing number.
> >>
> >>> *Here is the testing steps:
> >>> *1. make-bcache -B <> -C <> --writeback
> >>>
> >>> 2. Open two tabs, start different fio task in them at the same time.
> >>> Tab1 run below fio command:
> >>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
> >>> --iodepth=32 --rw=randrw --blocksize=64k,8k --direct=1 --runtime=24000
> >>>
> >>> Tab2 run below fio command:
> >>> sudo fio --name=random-writers2 --filename=/dev/bcache0
> >>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
> >>> --direct=1 --write_lat_log=rw --log_avg_msec=20
> >>>
> >>
> >>
> >> Why you limit the iodep to 8 and iops to 150 on cache device?
> >> For cache device the limitation is small. Iosp 150 with 4KB block size,
> >> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> >> hours it is only 70GB.
> >>
> >>
> >> What if the iodeps is 128 or 64, and no iops rate limitation ?
> >>
> >>
> >>> Note
> >>> - Tab1 fio will run for 24000 seconds, which is the one to cause the
> >>> fragmentation and made the cache_available_percent drops to under 40.
> >>> - Tab2 fio is the one that I'm capturing the latency and I have let it
> >>> run for about 35 hours, which is long enough to allow the
> >>> cache_available_percent drops under 30.
> >>> - This testing method utilized fio benchmark with larger read block
> >>> size/small write block size to cause the high fragmentation, However in
> >>> a real production env, there could be
> >>> various reasons or a combination of various reasons to cause the high
> >>> fragmentation, but I believe it should be ok to use any method to cause
> >>> the fragmentation to verify if
> >>> bcache with this patch is responding better than the master in this
> >>> situation.
> >>>
> >>> *Below is the testing result:*
> >>>
> >>> The total run time is about 35 hours, the latency points in the charts
> >>> for each run are 1.5 million
> >>>
> >>> Master:
> >>> fio-lat-mater.png
> >>>
> >>> Master + patch:
> >>> fio-lat-patch.png
> >>> Combine them together:
> >>> fio-lat-mix.png
> >>>
> >>> Now we can see the master is even worse when we increase the iodepth,
> >>> which makes sense since the backing HDD is being stressed more hardly.
> >>>
> >>> *Below are the cache stats changing during the run:*
> >>> Master:
> >>> bcache-stats-master.png
> >>>
> >>> Master + the patch:
> >>> bcache-stats-patch.png
> >>>
> >>> That's all the testing done with 400GB NVME with 512B block size.
> >>>
> >>> Coly, do you want me to continue the same testing on 1TB nvme with
> >>> different block size ?
> >>> or is it ok to skip the 1TB testing and continue the test with 400GB
> >>> NVME but with different block size?
> >>> feel free to let me know any other test scenarios that we should cover
> >>> here.
> >>
> >> Yes please, more testing is desired for performance improvement. So far
> >> I don't see performance number for real high work load yet.
> >>
> >> Thanks.
> >>
> >> Coly Li
> >>
>

2021-01-17 07:14:11

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

Hi Dongdong,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20210105-110903
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62
config: i386-randconfig-a002-20200806 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/7777fef68d1401235db42dd0d59c5c3dba3d42d3
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Dongdong-Tao/bcache-consider-the-fragmentation-when-update-the-writeback-rate/20210105-110903
git checkout 7777fef68d1401235db42dd0d59c5c3dba3d42d3
# save the attached .config to linux build tree
make W=1 ARCH=i386

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

ld: drivers/md/bcache/writeback.o: in function `__update_writeback_rate':
>> drivers/md/bcache/writeback.c:106: undefined reference to `__divdi3'
>> ld: drivers/md/bcache/writeback.c:120: undefined reference to `__divdi3'

vim +106 drivers/md/bcache/writeback.c

60
61 static void __update_writeback_rate(struct cached_dev *dc)
62 {
63 /*
64 * PI controller:
65 * Figures out the amount that should be written per second.
66 *
67 * First, the error (number of sectors that are dirty beyond our
68 * target) is calculated. The error is accumulated (numerically
69 * integrated).
70 *
71 * Then, the proportional value and integral value are scaled
72 * based on configured values. These are stored as inverses to
73 * avoid fixed point math and to make configuration easy-- e.g.
74 * the default value of 40 for writeback_rate_p_term_inverse
75 * attempts to write at a rate that would retire all the dirty
76 * blocks in 40 seconds.
77 *
78 * The writeback_rate_i_inverse value of 10000 means that 1/10000th
79 * of the error is accumulated in the integral term per second.
80 * This acts as a slow, long-term average that is not subject to
81 * variations in usage like the p term.
82 */
83 int64_t target = __calc_target_rate(dc);
84 int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
85 int64_t error = dirty - target;
86 int64_t proportional_scaled =
87 div_s64(error, dc->writeback_rate_p_term_inverse);
88 int64_t integral_scaled;
89 uint32_t new_rate;
90
91 /*
92 * We need to consider the number of dirty buckets as well
93 * when calculating the proportional_scaled, Otherwise we might
94 * have an unreasonable small writeback rate at a highly fragmented situation
95 * when very few dirty sectors consumed a lot dirty buckets, the
96 * worst case is when dirty_data reached writeback_percent and
97 * dirty buckets reached to cutoff_writeback_sync, but the rate
98 * still will be at the minimum value, which will cause the write
99 * stuck at a non-writeback mode.
100 */
101 struct cache_set *c = dc->disk.c;
102
103 int64_t dirty_buckets = c->nbuckets - c->avail_nbuckets;
104
105 if (c->gc_stats.in_use > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW && dirty > 0) {
> 106 int64_t fragment = (dirty_buckets * c->cache->sb.bucket_size) / dirty;
107 int64_t fp_term;
108 int64_t fps;
109
110 if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID) {
111 fp_term = dc->writeback_rate_fp_term_low *
112 (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW);
113 } else if (c->gc_stats.in_use <= BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH) {
114 fp_term = dc->writeback_rate_fp_term_mid *
115 (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID);
116 } else {
117 fp_term = dc->writeback_rate_fp_term_high *
118 (c->gc_stats.in_use - BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH);
119 }
> 120 fps = (dirty / dirty_buckets) * fp_term;
121 if (fragment > 3 && fps > proportional_scaled) {
122 //Only overrite the p when fragment > 3
123 proportional_scaled = fps;
124 }
125 }
126
127 if ((error < 0 && dc->writeback_rate_integral > 0) ||
128 (error > 0 && time_before64(local_clock(),
129 dc->writeback_rate.next + NSEC_PER_MSEC))) {
130 /*
131 * Only decrease the integral term if it's more than
132 * zero. Only increase the integral term if the device
133 * is keeping up. (Don't wind up the integral
134 * ineffectively in either case).
135 *
136 * It's necessary to scale this by
137 * writeback_rate_update_seconds to keep the integral
138 * term dimensioned properly.
139 */
140 dc->writeback_rate_integral += error *
141 dc->writeback_rate_update_seconds;
142 }
143
144 integral_scaled = div_s64(dc->writeback_rate_integral,
145 dc->writeback_rate_i_term_inverse);
146
147 new_rate = clamp_t(int32_t, (proportional_scaled + integral_scaled),
148 dc->writeback_rate_minimum, NSEC_PER_SEC);
149
150 dc->writeback_rate_proportional = proportional_scaled;
151 dc->writeback_rate_integral_scaled = integral_scaled;
152 dc->writeback_rate_change = new_rate -
153 atomic_long_read(&dc->writeback_rate.rate);
154 atomic_long_set(&dc->writeback_rate.rate, new_rate);
155 dc->writeback_rate_target = target;
156 }
157

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

Attachments:

(No filename) (6.03 kB)
.config.gz (29.50 kB)
Download all attachments