Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp222682pxb; Thu, 14 Jan 2021 04:25:02 -0800 (PST) X-Google-Smtp-Source: ABdhPJx7W71aqBYrHh4w5PwvoaIwRG6iJ8/BnuVvt3uf1WCFoiIgf4sdXP2LgIcsJPPtd1ov9qAb X-Received: by 2002:a17:906:82ce:: with SMTP id a14mr5086761ejy.421.1610627102669; Thu, 14 Jan 2021 04:25:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610627102; cv=none; d=google.com; s=arc-20160816; b=JxZPrAT9x9WiZ0urSWwViotXJBcnityxJ5uGx9mK1pbJKa+LihccwlVkVh/JhCzWgp usM/KvMMddVZb2h9bmAV0nyayODdphWvSlY1Ac3myUXKhTDwor/lfz9byDCCwFjGa2AZ vlLARpJ/MGmlYRINkoYU++3yBXKc5GcvejuGj15ptqoRo+vjvubghLowVOEXsFdSZajf bokqqApfvU30wQivFsRtyljBtSjMLtYDjwdg25+FX/SJxUxsTCAbIo/CjHVFYUo+ovr+ fGjHqUoq6DnRAA/OwQN299iwLIsKWktuma2YZtqC2Df5IvthekqcqVTg/Ca33cWMieZa k8dQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version; bh=39q23Keb1WqQfItbOZ2v0vZXTP4HRBS3OrsHfOi0b9M=; b=NV+tmmhleGZCCswofjnJkUmX+PnqxMW4S/k1b1KKNPH7xiemSo9jAx+MTd+bFCNp38 +Oe8lbmCYLVaUK7Fd9iQLYhi9/5KjxFsElm3sZq39uUkb07NfZwOf3hDKmjkg1WarEBP 0bnxSQQUZ3Ue4fWfXJ0dhsp6Lm5G1hAIqoo2LBuhzjMI7V3WCvhYJRJjWE29XY63IvsS spntb5xQywLZA8Cy//A83l+X9nf/67OpspjpumjtKKNWKDYpRbaP5eWf70uew0K92S8C EOkzt7oBfKo41CHUd4GzYCZajd7rM06CwD0WtUZP5kXMFdybgfDPJd8QZaTU1tlsOEWu axaw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u18si2634290edy.420.2021.01.14.04.24.37; Thu, 14 Jan 2021 04:25:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727848AbhANMX0 (ORCPT + 99 others); Thu, 14 Jan 2021 07:23:26 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:55578 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727378AbhANMXZ (ORCPT ); Thu, 14 Jan 2021 07:23:25 -0500 Received: from mail-oo1-f72.google.com ([209.85.161.72]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1l01eF-0005se-BB for linux-kernel@vger.kernel.org; Thu, 14 Jan 2021 12:22:43 +0000 Received: by mail-oo1-f72.google.com with SMTP id y21so2581271ooa.23 for ; Thu, 14 Jan 2021 04:22:43 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=39q23Keb1WqQfItbOZ2v0vZXTP4HRBS3OrsHfOi0b9M=; b=mqsBOkdYXs0QcpyJeBuJtG5/tleDvuKiBfKgd+T+7gqEp0oFvwDpUeoRcnyVua3zIa 9R9pQBomUlC9147t71vHX3lsD+yOhEMEFSMSB4KLbXx5GCLK/rK9MjVYZX/hhzb6JcKe GgaeOrvRP15lTrLF7ZyYzDfbWaS9UC+FLzKdWle+0ywQTOcgkydQgUJIH+N54sVqwhIv DoJ5DDvxSbcCwTr26jhEsSQw/NNLXRpr0EyUN0+xJ5k5bCuN0w5xzKuldajx8GR3OwgK HxWSWkZEmw3p+dXXp9ZZvOaLmlfgAvkNk/Dfq3DvX4G5wnSnh77E4z7VQIFKtLg6LOrW E/uQ== X-Gm-Message-State: AOAM5326d2Q0Q74LbFOLFBxf9NuatHTBN5xY2SV6usmWMm5X3aZttFcE gVk6JjgyZjKLf+VBASqyTSMyvoRP4rB5Ip0k4DdWFukWZCHCD4OcwacRPY/o1MbqdNj8XAY+blV 3EJnRXGPf37uFHKAhcgZEZk14YemkvCsMTxpIrwG1ZlxuIURRgeN1J1wmvw== X-Received: by 2002:a9d:c68:: with SMTP id 95mr4241987otr.328.1610626962137; Thu, 14 Jan 2021 04:22:42 -0800 (PST) X-Received: by 2002:a9d:c68:: with SMTP id 95mr4241971otr.328.1610626961835; Thu, 14 Jan 2021 04:22:41 -0800 (PST) MIME-Version: 1.0 References: <20210105030602.14427-1-tdd21151186@gmail.com> <1a4b2a68-a7b0-8eb0-e60b-c3cf5a5a9e56@suse.de> <084276ab-7c74-31be-b957-3b039d7061a1@suse.de> <299ea3ff-4a9c-734e-0ec1-8b8d7480a019@suse.de> <392abd73-c58a-0a34-bd21-1e9adfffc870@suse.de> In-Reply-To: <392abd73-c58a-0a34-bd21-1e9adfffc870@suse.de> From: Dongdong Tao Date: Thu, 14 Jan 2021 20:22:30 +0800 Message-ID: Subject: Re: [PATCH] bcache: consider the fragmentation when update the writeback rate To: Coly Li Cc: Kent Overstreet , "open list:BCACHE (BLOCK LAYER CACHE)" , open list , Gavin Guo , Gerald Yang , Trent Lloyd , Dominique Poulain , Dongsheng Yang , Benjamin Allot Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Coly, Why you limit the iodeph to 8 and iops to 150 on cache device? For cache device the limitation is small. Iosp 150 with 4KB block size, it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35 hours it is only 70GB. What if the iodepth is 128 or 64, and no iops rate limitation ? -> There are two reasons why I limit the iodepth and iops rate. 1. If I don't limit them, the dirty cache will be filled up very quickly within 20 minutes. It's almost NVME speed before it reaches the 70 cutoff_writeback_sync, there is no way for any kind of writeback to stop it from filling up due to the huge gap between NVME and HDD in terms of the throughput, I don't think there is anything we can do about it? and it should only happen in a benchmark world, not should in production. The improvement I'm trying to do here is just for normal production workload ,not for this benchmark scenario really. I currently can't see any necessity to test this scenario, please kindly let me know about this if I'm wrong. 2. The reason that I set iodepth to 8 and iops to 150 is based on the experience that I observed from production env, mostly ceph, ceph-osd has less than 10 thread(default setting) that will send io to bcache in parallel. But I'm not sure about other applications. I agree that we can increase the iodepth to 64 or 128 and it's doable. But we have to limit the iops, 150 IOPS is a reasonable workload. The most busy ceph-osd that I've seen is about 1000 IOPS, but on average is still only about 600. I can set the IOPS to a higher value like 600 and the iodepth to 128 to perform the later test if it make sense to you? Lastly, please allow me to clarify more about the production issue that this patch is trying to address: In the production env that hit this issue, it usually takes a very long time (many take days) for the cache_available_percent to drop to 30, and the dirty data is mostly staying at a very low level (around 10 percent), which means that the bcache isn't being stressed very hard most of the time. There is no intention to save the cutoff_writeback_sync when the bcache is being stressed without limitation, hope above make sense :) By the way, my colleague and I are trying to gathering some production bcache stats, I hope we can give you the performance number before and after applying the patch. Thanks, Dongdong On Thu, Jan 14, 2021 at 6:05 PM Coly Li wrote: > > On 1/14/21 12:45 PM, Dongdong Tao wrote: > > Hi Coly, > > > > I've got the testing data for multiple threads with larger IO depth. > > > > Hi Dongdong, > > Thanks for the testing number. > > > *Here is the testing steps: > > *1. make-bcache -B <> -C <> --writeback > > > > 2. Open two tabs, start different fio task in them at the same time. > > Tab1 run below fio command: > > sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio > > --iodepth=32 --rw=randrw --blocksize=64k,8k --direct=1 --runtime=24000 > > > > Tab2 run below fio command: > > sudo fio --name=random-writers2 --filename=/dev/bcache0 > > --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150 > > --direct=1 --write_lat_log=rw --log_avg_msec=20 > > > > > Why you limit the iodep to 8 and iops to 150 on cache device? > For cache device the limitation is small. Iosp 150 with 4KB block size, > it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35 > hours it is only 70GB. > > > What if the iodeps is 128 or 64, and no iops rate limitation ? > > > > Note > > - Tab1 fio will run for 24000 seconds, which is the one to cause the > > fragmentation and made the cache_available_percent drops to under 40. > > - Tab2 fio is the one that I'm capturing the latency and I have let it > > run for about 35 hours, which is long enough to allow the > > cache_available_percent drops under 30. > > - This testing method utilized fio benchmark with larger read block > > size/small write block size to cause the high fragmentation, However in > > a real production env, there could be > > various reasons or a combination of various reasons to cause the high > > fragmentation, but I believe it should be ok to use any method to cause > > the fragmentation to verify if > > bcache with this patch is responding better than the master in this > > situation. > > > > *Below is the testing result:* > > > > The total run time is about 35 hours, the latency points in the charts > > for each run are 1.5 million > > > > Master: > > fio-lat-mater.png > > > > Master + patch: > > fio-lat-patch.png > > Combine them together: > > fio-lat-mix.png > > > > Now we can see the master is even worse when we increase the iodepth, > > which makes sense since the backing HDD is being stressed more hardly. > > > > *Below are the cache stats changing during the run:* > > Master: > > bcache-stats-master.png > > > > Master + the patch: > > bcache-stats-patch.png > > > > That's all the testing done with 400GB NVME with 512B block size. > > > > Coly, do you want me to continue the same testing on 1TB nvme with > > different block size ? > > or is it ok to skip the 1TB testing and continue the test with 400GB > > NVME but with different block size? > > feel free to let me know any other test scenarios that we should cover > > here. > > Yes please, more testing is desired for performance improvement. So far > I don't see performance number for real high work load yet. > > Thanks. > > Coly Li >