Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756324AbbFPOO5 (ORCPT ); Tue, 16 Jun 2015 10:14:57 -0400 Received: from smtpbg303.qq.com ([184.105.206.26]:35101 "EHLO smtpbg303.qq.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754384AbbFPOOu (ORCPT ); Tue, 16 Jun 2015 10:14:50 -0400 X-QQ-mid: bizesmtp10t1434464085t332t26 X-QQ-SSF: 0140000000000010F322000A0000000 X-QQ-FEAT: JibMalLukFaVmrUspHR83xAmfbupn9RBcaYzhLDpQHVWzlIkympdjqGTx3i2B sGI0p+KenLq1C89iwJwjvEF4M5aO0Z8sEkWVcoZvdXTtK6oIEhAEI6mS52KvHVVM/Ly1DCA yzVra01fAawG0obQ2rEP4ZYzk9kbdB3B2ou7b2WqkGqcqKDW5yobYjTGQw7B0sn61uWFli1 wECgI9i/PYkPXEp/0ng9zr/+Rt0CLjjwzmgrpUMCTdQ== X-QQ-GoodBg: 2 Message-ID: <55802F46.7020804@unitedstack.com> Date: Tue, 16 Jun 2015 22:14:30 +0800 From: juncheng bai User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Ilya Dryomov CC: idryomov@redhat.com, Alex Elder , Josh Durgin , Guangliang Zhao , jeff@garzik.org, yehuda@hq.newdream.net, Sage Weil , elder@inktank.com, "linux-kernel@vger.kernel.org" , Ceph Development Subject: Re: [PATCH RFC] storage:rbd: make the size of request is equal to the, size of the object References: <557EB47F.6090708@unitedstack.com> <557ED1D4.20605@unitedstack.com> <557F97CB.6070608@unitedstack.com> <55800F24.6060100@unitedstack.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-QQ-SENDSIZE: 520 X-QQ-FName: B2D5C234A9104CF396963D484988BB4E X-QQ-LocalIP: 127.0.0.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8451 Lines: 243 On 2015/6/16 21:30, Ilya Dryomov wrote: > On Tue, Jun 16, 2015 at 2:57 PM, juncheng bai > wrote: >> >> >> On 2015/6/16 16:37, Ilya Dryomov wrote: >>> >>> On Tue, Jun 16, 2015 at 6:28 AM, juncheng bai >>> wrote: >>>> >>>> >>>> >>>> On 2015/6/15 22:27, Ilya Dryomov wrote: >>>>> >>>>> >>>>> On Mon, Jun 15, 2015 at 4:23 PM, juncheng bai >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 2015/6/15 21:03, Ilya Dryomov wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 15, 2015 at 2:18 PM, juncheng bai >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> From 6213215bd19926d1063d4e01a248107dab8a899b Mon Sep 17 00:00:00 >>>>>>>> 2001 >>>>>>>> From: juncheng bai >>>>>>>> Date: Mon, 15 Jun 2015 18:34:00 +0800 >>>>>>>> Subject: [PATCH] storage:rbd: make the size of request is equal to >>>>>>>> the >>>>>>>> size of the object >>>>>>>> >>>>>>>> ensures that the merged size of request can achieve the size of >>>>>>>> the object. >>>>>>>> when merge a bio to request or merge a request to request, the >>>>>>>> sum of the segment number of the current request and the segment >>>>>>>> number of the bio is not greater than the max segments of the >>>>>>>> request, >>>>>>>> so the max size of request is 512k if the max segments of request is >>>>>>>> BLK_MAX_SEGMENTS. >>>>>>>> >>>>>>>> Signed-off-by: juncheng bai >>>>>>>> --- >>>>>>>> drivers/block/rbd.c | 2 ++ >>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>> >>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>>>>>> index 0a54c58..dec6045 100644 >>>>>>>> --- a/drivers/block/rbd.c >>>>>>>> +++ b/drivers/block/rbd.c >>>>>>>> @@ -3757,6 +3757,8 @@ static int rbd_init_disk(struct rbd_device >>>>>>>> *rbd_dev) >>>>>>>> segment_size = rbd_obj_bytes(&rbd_dev->header); >>>>>>>> blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); >>>>>>>> blk_queue_max_segment_size(q, segment_size); >>>>>>>> + if (segment_size > BLK_MAX_SEGMENTS * PAGE_SIZE) >>>>>>>> + blk_queue_max_segments(q, segment_size / PAGE_SIZE); >>>>>>>> blk_queue_io_min(q, segment_size); >>>>>>>> blk_queue_io_opt(q, segment_size); >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I made a similar patch on Friday, investigating blk-mq plugging issue >>>>>>> reported by Nick. My patch sets it to BIO_MAX_PAGES unconditionally - >>>>>>> AFAIU there is no point in setting to anything bigger since the bios >>>>>>> will be clipped to that number of vecs. Given that BIO_MAX_PAGES is >>>>>>> 256, this gives is 1M direct I/Os. >>>>>> >>>>>> >>>>>> >>>>>> Hi. For signal bio, the max number of bio_vec is BIO_MAX_PAGES, but a >>>>>> request can be merged from multiple bios. We can see the below >>>>>> function: >>>>>> ll_back_merge_fn, ll_front_merge_fn and etc. >>>>>> And I test in kernel 3.18 use this patch, and do: >>>>>> echo 4096 > /sys/block/rbd0/queue/max_sectors_kb >>>>>> We use systemtap to trace the request size, It is upto 4M. >>>>> >>>>> >>>>> >>>>> Kernel 3.18 is pre rbd blk-mq transition, which happened in 4.0. You >>>>> should test whatever patches you have with at least 4.0. >>>>> >>>>> Putting that aside, I must be missing something. You'll get 4M >>>>> requests on 3.18 both with your patch and without it, the only >>>>> difference would be the size of bios being merged - 512k vs 1M. Can >>>>> you describe your test workload and provide before and after traces? >>>>> >>>> Hi. I update kernel version to 4.0.5. The test information as shown >>>> below: >>>> The base information: >>>> 03:28:13-root@server-186:~$uname -r >>>> 4.0.5 >>>> >>>> My simple systemtap script: >>>> probe module("rbd").function("rbd_img_request_create") >>>> { >>>> printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3)); >>>> } >>>> >>>> I use dd to execute the test case: >>>> dd if=/dev/zero of=/dev/rbd0 bs=4M count=1 oflag=direct >>>> >>>> Case one: Without patch >>>> 03:30:23-root@server-186:~$cat /sys/block/rbd0/queue/max_sectors_kb >>>> 4096 >>>> 03:30:35-root@server-186:~$cat /sys/block/rbd0/queue/max_segments >>>> 128 >>>> >>>> The output of systemtap for nornal data: >>>> offset:0 length:524288 >>>> offset:524288 length:524288 >>>> offset:1048576 length:524288 >>>> offset:1572864 length:524288 >>>> offset:2097152 length:524288 >>>> offset:2621440 length:524288 >>>> offset:3145728 length:524288 >>>> offset:3670016 length:524288 >>>> >>>> Case two:With patch >>>> cat /sys/block/rbd0/queue/max_sectors_kb >>>> 4096 >>>> 03:49:14-root@server-186:linux-4.0.5$cat >>>> /sys/block/rbd0/queue/max_segments >>>> 1024 >>>> The output of systemtap for nornal data: >>>> offset:0 length:1048576 >>>> offset:1048576 length:1048576 >>>> offset:2097152 length:1048576 >>>> offset:3145728 length:1048576 >>>> >>>> According to the test, you are right. >>>> Because the blk-mq doesn't use any scheduling policy. >>>> 03:52:13-root@server-186:linux-4.0.5$cat /sys/block/rbd0/queue/scheduler >>>> none >>>> >>>> In previous versions of the kernel 4.0, the rbd use the defualt >>>> scheduler:cfq >>>> >>>> So, I think that the blk-mq need to do more? >>> >>> >>> There is no scheduler support in blk-mq as of now but your numbers >>> don't have anything to do with that. The current behaviour is a result >>> of a bug in blk-mq. It's fixed by [1], if you apply it you should see >>> 4M requests with your stap script. >>> >>> [1] http://article.gmane.org/gmane.linux.kernel/1941750 >>> >> Hi. >> First, Let's look at the result in the kernel version 3.18 >> The function blk_limits_max_hw_sectors different implemention between 3.18 >> and 4.0+. We need do: >> echo 4094 >/sys/block/rbd0/queue/max_sectors_kb >> >> The rbd device information: >> 11:13:18-root@server-186:~$cat /sys/block/rbd0/queue/max_sectors_kb >> 4094 >> 11:15:28-root@server-186:~$cat /sys/block/rbd0/queue/max_segments >> 1024 >> >> The test command: >> dd if=/dev/zero of=/dev/rbd0 bs=4M count=1 >> >> The simple stap script: >> probe module("rbd").function("rbd_img_request_create") >> { >> printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3)); >> } >> >> The output from stap: >> offset:0 length:4190208 >> offset:21474770944 length:4096 >> >> Second, thanks for your patch [1]. >> I use the patch [1], and recompile the kernel. >> The test information as shown below: >> 12:26:12-root@server-186:$cat /sys/block/rbd0/queue/max_segments >> 1024 >> 12:26:23-root@server-186:$cat /sys/block/rbd0/queue/max_sectors_kb >> 4096 >> >> The test command: >> dd if=/dev/zero of=/dev/rbd0 bs=4M count=2 oflag=direct >> >> The simple systemtap script: >> probe module("rbd").function("rbd_img_request_create") >> { >> printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3)); >> } >> >> The output of systemtap for nornal data: >> offset:0 length:4194304 >> offset:4194304 length:4194304 >> offset:21474770944 length:4096 > > Sorry, I fail to see the purpose of the above tests. The test commands > differ, the kernels differ and it looks like you had your patch applied > for both tests. What I'm trying to get you to do is to show me some > data that will back your claim (which your patch is based on): > >> >> So, I think that the max_segments of request_limits should be divide the >> object size by PAGE_SIZE. > > For that you need to use the same kernel and run the same workload. > The only difference should be whether your patch is applied or not. > I still think that setting rbd max_segments to anything above > BIO_MAX_PAGES is bogus, but I'd be happy to be shown wrong on that > since that would mean better performance, at least in some > workloads. > Hi. For cloned image, it will avoid doing copyup if the request size is equal to the object size, I think that it is the key effect of this patch. The big request would result in overtime if the ceph backend is busy or the network bandwidth is too low. I suggest that add a module parameter to control the value which decided by the user settings. Thanks. ---- juncheng bai > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/