Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp358254pxb; Thu, 14 Jan 2021 07:37:45 -0800 (PST) X-Google-Smtp-Source: ABdhPJzLiI8GUZoEkcG8pgZ3ns7kU/Md0hLwDyCrMJ71PyGylPgB5/6yZiRORalW67W5xf6rUT3G X-Received: by 2002:a17:906:350f:: with SMTP id r15mr5097484eja.480.1610638665657; Thu, 14 Jan 2021 07:37:45 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610638665; cv=none; d=google.com; s=arc-20160816; b=IOAbI5ITUh//9OHtHl7WKzUSRvq7B/zUTjpyvEwFUZkR5af0iHLsuXmQK1+qotcgl+ XFo3lDboHK/g8bgxHvilI4DgVHMnDotbU3LBJW9gWJEf1XTE2RTnIr4Km64YbNvhKwqB lz9FjRssGXQA58IA4hl39whZL4PX3jLvWXurEjdjNazW1++PIfqedqKFj9jkKicSZQLr Wb70upzQRTDbExeHhC6dneumzFaDOdp5XmyKoYpRxEXsW6bZqhWYTCrVgY1tDN5YhrTt Kjw/iqAhidkXDvXzA0xdRW6yNGG6KaLMWZhwzGGoHnU94c2db9SDMAMf+qKJv6iigUej 6IvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version; bh=qYot9zCDucWco4V0yid0zIuAnNyVpGykxmK/k1V5cpE=; b=pbLtEd8haJDWOtOna3PmaOFqxPdGmrNBNULL8Nkr6YFkcd/iG9CMaGLfBihiXe/WjW S50ew26TyhWjEeF96RcJuN92M5h34HfFmqDjRerZRSXa3mfHJAfnXhUdJMjsVrub9f12 hzHGzIQS2u1MjeaBj41ADg6LZfN4JnHiz4RVWNvmpscG+o4ppoH67TAM8JgdnTzGozeK on83suoJx1mDizyWykfF5plZTAlVHPjBHJzdNnuhXvDz1WfL80E/jsr2ORUMMxWIl9Pp wMVpPMDhDz6eqH8p0sZPQuNT1VUzpqy50UVgm3BjzzEPl6CYR3oT4mv/Ogar6DxB8h/2 7aSA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i22si2446806ejh.128.2021.01.14.07.37.21; Thu, 14 Jan 2021 07:37:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729270AbhANPgR (ORCPT + 99 others); Thu, 14 Jan 2021 10:36:17 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:33533 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727083AbhANPgQ (ORCPT ); Thu, 14 Jan 2021 10:36:16 -0500 Received: from mail-oi1-f200.google.com ([209.85.167.200]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1l04er-0003ht-BV for linux-kernel@vger.kernel.org; Thu, 14 Jan 2021 15:35:33 +0000 Received: by mail-oi1-f200.google.com with SMTP id t206so2552923oib.5 for ; Thu, 14 Jan 2021 07:35:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qYot9zCDucWco4V0yid0zIuAnNyVpGykxmK/k1V5cpE=; b=cOq5Uw3bTZTkyOT9vxvx3bpgY8fhTHHQOCIcTYwGStbwPqveL7HD0k/P5xQPEqDCVF yeL8h3OQVihB9y/qJiukwbA8k3K5jY3eIqMThV5LMtYJzGdGbrPb6dDmiE3MMg+IoXoa 9UWMzY6pgi/uc19HZ2M5Wq+4vX+N1cUIFrCnxLrTzcG6IjvDrORjF7paqfdBI4voa2Gt B2CSPBFTA3dYG2v1Qysi1Uwetk/p0YYJ1Ud3PmN7EfxQVh3MpV6p3rPGeROI0Q1Eeluw tMAgVYeg1HzHeaxI5tzIwIJ8uZFk9Dqrl2kPR0zpVa8FhKpmXnLOCNEE8hsqAQjD1qP9 VSsw== X-Gm-Message-State: AOAM532G1Vd+Nr5XR/V4xYfi8w16ObzWY3KQrnuKLLj6GEFypVkApMdg TaMWUEyhrfwu679SiBcouCE2ZEQEBJPO9M8x5oRO4tlJ/0abbxd8aqAGlIAOaul32dMQHn7kV3c 924c+9DL+YJshMURQlAQoa81grQguhWr4Llzrw4SKNzz2vjNfSQVzFFQluQ== X-Received: by 2002:a9d:c68:: with SMTP id 95mr4789377otr.328.1610638530800; Thu, 14 Jan 2021 07:35:30 -0800 (PST) X-Received: by 2002:a9d:c68:: with SMTP id 95mr4789283otr.328.1610638529053; Thu, 14 Jan 2021 07:35:29 -0800 (PST) MIME-Version: 1.0 References: <20210105030602.14427-1-tdd21151186@gmail.com> <1a4b2a68-a7b0-8eb0-e60b-c3cf5a5a9e56@suse.de> <084276ab-7c74-31be-b957-3b039d7061a1@suse.de> <299ea3ff-4a9c-734e-0ec1-8b8d7480a019@suse.de> <392abd73-c58a-0a34-bd21-1e9adfffc870@suse.de> <3ca15755-9ad2-1d57-b86a-fb659f701cfb@suse.de> In-Reply-To: <3ca15755-9ad2-1d57-b86a-fb659f701cfb@suse.de> From: Dongdong Tao Date: Thu, 14 Jan 2021 23:35:16 +0800 Message-ID: Subject: Re: [PATCH] bcache: consider the fragmentation when update the writeback rate To: Coly Li Cc: Kent Overstreet , "open list:BCACHE (BLOCK LAYER CACHE)" , open list , Gavin Guo , Gerald Yang , Trent Lloyd , Dominique Poulain , Dongsheng Yang , Benjamin Allot Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Coly, Apologies for any confusion that I might have caused, and thanks a lot for your patience and your help ! On Thu, Jan 14, 2021 at 9:31 PM Coly Li wrote: > > On 1/14/21 8:22 PM, Dongdong Tao wrote: > > Hi Coly, > > > > Why you limit the iodeph to 8 and iops to 150 on cache device? > > For cache device the limitation is small. Iosp 150 with 4KB block size, > > it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35 > > hours it is only 70GB. > > > > > > What if the iodepth is 128 or 64, and no iops rate limitation ? > > -> There are two reasons why I limit the iodepth and iops rate. > > 1. If I don't limit them, the dirty cache will be filled up very > > quickly within 20 minutes. > > It's almost NVME speed before it reaches the 70 > > cutoff_writeback_sync, there is no way for any kind of writeback to > > stop it from > > filling up due to the huge gap between NVME and HDD in terms of > > the throughput, > > I don't think there is anything we can do about it? and it should > > only happen in a benchmark world, not should in production. > > The improvement I'm trying to do here is just for normal > > production workload ,not for this benchmark scenario really. > > I currently can't see any necessity to test this scenario, please > > kindly let me know about this if I'm wrong. > > > > 2. The reason that I set iodepth to 8 and iops to 150 is based on the > > experience that I observed from production env, mostly ceph, > > ceph-osd has less than 10 thread(default setting) that will send > > io to bcache in parallel. But I'm not sure about other applications. > > I agree that we can increase the iodepth to 64 or 128 and it's > > doable. But we have to limit the iops, 150 IOPS is a reasonable > > workload. > > The most busy ceph-osd that I've seen is about 1000 IOPS, but on > > average is still only about 600. > > I can set the IOPS to a higher value like 600 and the iodepth to > > 128 to perform the later test if it make sense to you? > > > > OK, now I know the reason with the extra information. Since the cache > device is filled up within 20 minutes, it is unnecessary to do the > faster testing on your side. Let me do it later on my hardware. > > > > Lastly, please allow me to clarify more about the production issue > > that this patch is trying to address: > > > > In the production env that hit this issue, it usually takes a very > > long time (many take days) for the cache_available_percent to drop to > > 30, and the dirty data is mostly staying at a very low level (around > > 10 percent), which means that the bcache isn't being stressed very > > hard most of the time. > > There is no intention to save the cutoff_writeback_sync when the > > bcache is being stressed without limitation, hope above make sense :) > > > > Yes you explained clearly previously. What I worried was whether a > faster writeback may interfere throughput and latency of regular I/O > regular I/Os. > > From your current testing data it looks find with me. > > > > By the way, my colleague and I are trying to gathering some production > > bcache stats, I hope we can give you the performance number before and > > after applying the patch. > > Yes that will be great. > > And could you please gather all current data chats into a single email, > and reference it in your patch via lore ? Then for people don't > subscribe linux-bcache mailing list, they may find all the posted > performance data from you patch. > Sounds good, I'll update the patch comment with reference data. But it seems like the linux mailing list doesn't accept chart ? (always been detected as SPAM) But, I can't be sure, I'll try to send it again, but if not, I'll put all those data into a google doc. > In general your testing data is convinced IMHO, and I will add your > updated patch for 5.12 merge window. > Thank you Coly, that's great !!! > > Thanks. > > Coly Li > > > > > > > > On Thu, Jan 14, 2021 at 6:05 PM Coly Li wrote: > >> > >> On 1/14/21 12:45 PM, Dongdong Tao wrote: > >>> Hi Coly, > >>> > >>> I've got the testing data for multiple threads with larger IO depth. > >>> > >> > >> Hi Dongdong, > >> > >> Thanks for the testing number. > >> > >>> *Here is the testing steps: > >>> *1. make-bcache -B <> -C <> --writeback > >>> > >>> 2. Open two tabs, start different fio task in them at the same time. > >>> Tab1 run below fio command: > >>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio > >>> --iodepth=32 --rw=randrw --blocksize=64k,8k --direct=1 --runtime=24000 > >>> > >>> Tab2 run below fio command: > >>> sudo fio --name=random-writers2 --filename=/dev/bcache0 > >>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150 > >>> --direct=1 --write_lat_log=rw --log_avg_msec=20 > >>> > >> > >> > >> Why you limit the iodep to 8 and iops to 150 on cache device? > >> For cache device the limitation is small. Iosp 150 with 4KB block size, > >> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35 > >> hours it is only 70GB. > >> > >> > >> What if the iodeps is 128 or 64, and no iops rate limitation ? > >> > >> > >>> Note > >>> - Tab1 fio will run for 24000 seconds, which is the one to cause the > >>> fragmentation and made the cache_available_percent drops to under 40. > >>> - Tab2 fio is the one that I'm capturing the latency and I have let it > >>> run for about 35 hours, which is long enough to allow the > >>> cache_available_percent drops under 30. > >>> - This testing method utilized fio benchmark with larger read block > >>> size/small write block size to cause the high fragmentation, However in > >>> a real production env, there could be > >>> various reasons or a combination of various reasons to cause the high > >>> fragmentation, but I believe it should be ok to use any method to cause > >>> the fragmentation to verify if > >>> bcache with this patch is responding better than the master in this > >>> situation. > >>> > >>> *Below is the testing result:* > >>> > >>> The total run time is about 35 hours, the latency points in the charts > >>> for each run are 1.5 million > >>> > >>> Master: > >>> fio-lat-mater.png > >>> > >>> Master + patch: > >>> fio-lat-patch.png > >>> Combine them together: > >>> fio-lat-mix.png > >>> > >>> Now we can see the master is even worse when we increase the iodepth, > >>> which makes sense since the backing HDD is being stressed more hardly. > >>> > >>> *Below are the cache stats changing during the run:* > >>> Master: > >>> bcache-stats-master.png > >>> > >>> Master + the patch: > >>> bcache-stats-patch.png > >>> > >>> That's all the testing done with 400GB NVME with 512B block size. > >>> > >>> Coly, do you want me to continue the same testing on 1TB nvme with > >>> different block size ? > >>> or is it ok to skip the 1TB testing and continue the test with 400GB > >>> NVME but with different block size? > >>> feel free to let me know any other test scenarios that we should cover > >>> here. > >> > >> Yes please, more testing is desired for performance improvement. So far > >> I don't see performance number for real high work load yet. > >> > >> Thanks. > >> > >> Coly Li > >> >