Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp267120pxb; Thu, 14 Jan 2021 05:35:47 -0800 (PST) X-Google-Smtp-Source: ABdhPJy43CmnanJih1T4/zdlaqe0xGj1o8w58a3u4s621TcgC87hdId+rcqOs9cKm1i5cUC7tv6S X-Received: by 2002:a50:8e19:: with SMTP id 25mr5667808edw.263.1610631347138; Thu, 14 Jan 2021 05:35:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610631347; cv=none; d=google.com; s=arc-20160816; b=0prC5MP5wovAvizSpS2kVaxLO4Jr3UD8WrULIro53ip5UnTP699uFiqIrq2xtVHJNE ifkl4PibuM2xkEramA8DgLNlEsDoFlRdy0OnN5vJOVtWuPat81WOXVsMDVd4obxgI7vq ip72M882qSmX0Suz+ya6TU0xoep3klCBs83hmsZfKaei4Qn8imR1oci/Ptb7AmgOLpgD Pc/1w6L107dWZdvKLkExjHchYTS4Odr7GuU21acCsyaLzJKyPB/pZsh2FPwvpXNSzXcX NqxzxRjQuPvtSRow6BAUPotj80zhXoD8LKPJ+QRKAAW5dKUEbOl4AQntawNBYi4kYYL3 aafA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject:from :references:cc:to; bh=wgLMrlsGQaYfejcpjf9AiezpfMeO/U7iqOnG9+N9Hbo=; b=K1SswS9hhX0fIgQXkGw5Xh2NAIcl4ExK35azT8eiwcB5j7DtSQSTKPxna1tVAsaDcL DI63FlyNmw25RIV7Elr0jth/is2klPWtWLKka9b5kcHH16zvnsc7WS7/8TQv6UmggIqT B232oI8cMRGOv/KgKhBX2oJ7AJMQBxloVmquZX9BMJuWmlyR31ys1OC/z+Rm97CdS+bA Rbd8xIPDicCRl9pejsCRqPDBca+w47u9Z/4fyA+/G0dnKzRSkr9r2ZPdgcp17rlYDEbK KDEkwZsNsrFn/YDYCHujt+Sf7/OE6yoYCA52DoZ3rvrqqH3MrcaLP7ulTaiZNA6TisuX GxJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dk11si2598033edb.594.2021.01.14.05.35.22; Thu, 14 Jan 2021 05:35:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727887AbhANNcM (ORCPT + 99 others); Thu, 14 Jan 2021 08:32:12 -0500 Received: from mx2.suse.de ([195.135.220.15]:46592 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725955AbhANNcM (ORCPT ); Thu, 14 Jan 2021 08:32:12 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 97982AD78; Thu, 14 Jan 2021 13:31:30 +0000 (UTC) To: Dongdong Tao Cc: Kent Overstreet , "open list:BCACHE (BLOCK LAYER CACHE)" , open list , Gavin Guo , Gerald Yang , Trent Lloyd , Dominique Poulain , Dongsheng Yang , Benjamin Allot References: <20210105030602.14427-1-tdd21151186@gmail.com> <1a4b2a68-a7b0-8eb0-e60b-c3cf5a5a9e56@suse.de> <084276ab-7c74-31be-b957-3b039d7061a1@suse.de> <299ea3ff-4a9c-734e-0ec1-8b8d7480a019@suse.de> <392abd73-c58a-0a34-bd21-1e9adfffc870@suse.de> From: Coly Li Subject: Re: [PATCH] bcache: consider the fragmentation when update the writeback rate Message-ID: <3ca15755-9ad2-1d57-b86a-fb659f701cfb@suse.de> Date: Thu, 14 Jan 2021 21:31:25 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/14/21 8:22 PM, Dongdong Tao wrote: > Hi Coly, > > Why you limit the iodeph to 8 and iops to 150 on cache device? > For cache device the limitation is small. Iosp 150 with 4KB block size, > it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35 > hours it is only 70GB. > > > What if the iodepth is 128 or 64, and no iops rate limitation ? > -> There are two reasons why I limit the iodepth and iops rate. > 1. If I don't limit them, the dirty cache will be filled up very > quickly within 20 minutes. > It's almost NVME speed before it reaches the 70 > cutoff_writeback_sync, there is no way for any kind of writeback to > stop it from > filling up due to the huge gap between NVME and HDD in terms of > the throughput, > I don't think there is anything we can do about it? and it should > only happen in a benchmark world, not should in production. > The improvement I'm trying to do here is just for normal > production workload ,not for this benchmark scenario really. > I currently can't see any necessity to test this scenario, please > kindly let me know about this if I'm wrong. > > 2. The reason that I set iodepth to 8 and iops to 150 is based on the > experience that I observed from production env, mostly ceph, > ceph-osd has less than 10 thread(default setting) that will send > io to bcache in parallel. But I'm not sure about other applications. > I agree that we can increase the iodepth to 64 or 128 and it's > doable. But we have to limit the iops, 150 IOPS is a reasonable > workload. > The most busy ceph-osd that I've seen is about 1000 IOPS, but on > average is still only about 600. > I can set the IOPS to a higher value like 600 and the iodepth to > 128 to perform the later test if it make sense to you? > OK, now I know the reason with the extra information. Since the cache device is filled up within 20 minutes, it is unnecessary to do the faster testing on your side. Let me do it later on my hardware. > Lastly, please allow me to clarify more about the production issue > that this patch is trying to address: > > In the production env that hit this issue, it usually takes a very > long time (many take days) for the cache_available_percent to drop to > 30, and the dirty data is mostly staying at a very low level (around > 10 percent), which means that the bcache isn't being stressed very > hard most of the time. > There is no intention to save the cutoff_writeback_sync when the > bcache is being stressed without limitation, hope above make sense :) > Yes you explained clearly previously. What I worried was whether a faster writeback may interfere throughput and latency of regular I/O regular I/Os. From your current testing data it looks find with me. > By the way, my colleague and I are trying to gathering some production > bcache stats, I hope we can give you the performance number before and > after applying the patch. Yes that will be great. And could you please gather all current data chats into a single email, and reference it in your patch via lore ? Then for people don't subscribe linux-bcache mailing list, they may find all the posted performance data from you patch. In general your testing data is convinced IMHO, and I will add your updated patch for 5.12 merge window. Thanks. Coly Li > > > On Thu, Jan 14, 2021 at 6:05 PM Coly Li wrote: >> >> On 1/14/21 12:45 PM, Dongdong Tao wrote: >>> Hi Coly, >>> >>> I've got the testing data for multiple threads with larger IO depth. >>> >> >> Hi Dongdong, >> >> Thanks for the testing number. >> >>> *Here is the testing steps: >>> *1. make-bcache -B <> -C <> --writeback >>> >>> 2. Open two tabs, start different fio task in them at the same time. >>> Tab1 run below fio command: >>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio >>> --iodepth=32 --rw=randrw --blocksize=64k,8k --direct=1 --runtime=24000 >>> >>> Tab2 run below fio command: >>> sudo fio --name=random-writers2 --filename=/dev/bcache0 >>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150 >>> --direct=1 --write_lat_log=rw --log_avg_msec=20 >>> >> >> >> Why you limit the iodep to 8 and iops to 150 on cache device? >> For cache device the limitation is small. Iosp 150 with 4KB block size, >> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35 >> hours it is only 70GB. >> >> >> What if the iodeps is 128 or 64, and no iops rate limitation ? >> >> >>> Note >>> - Tab1 fio will run for 24000 seconds, which is the one to cause the >>> fragmentation and made the cache_available_percent drops to under 40. >>> - Tab2 fio is the one that I'm capturing the latency and I have let it >>> run for about 35 hours, which is long enough to allow the >>> cache_available_percent drops under 30. >>> - This testing method utilized fio benchmark with larger read block >>> size/small write block size to cause the high fragmentation, However in >>> a real production env, there could be >>> various reasons or a combination of various reasons to cause the high >>> fragmentation, but I believe it should be ok to use any method to cause >>> the fragmentation to verify if >>> bcache with this patch is responding better than the master in this >>> situation. >>> >>> *Below is the testing result:* >>> >>> The total run time is about 35 hours, the latency points in the charts >>> for each run are 1.5 million >>> >>> Master: >>> fio-lat-mater.png >>> >>> Master + patch: >>> fio-lat-patch.png >>> Combine them together: >>> fio-lat-mix.png >>> >>> Now we can see the master is even worse when we increase the iodepth, >>> which makes sense since the backing HDD is being stressed more hardly. >>> >>> *Below are the cache stats changing during the run:* >>> Master: >>> bcache-stats-master.png >>> >>> Master + the patch: >>> bcache-stats-patch.png >>> >>> That's all the testing done with 400GB NVME with 512B block size. >>> >>> Coly, do you want me to continue the same testing on 1TB nvme with >>> different block size ? >>> or is it ok to skip the 1TB testing and continue the test with 400GB >>> NVME but with different block size? >>> feel free to let me know any other test scenarios that we should cover >>> here. >> >> Yes please, more testing is desired for performance improvement. So far >> I don't see performance number for real high work load yet. >> >> Thanks. >> >> Coly Li >>