Received: by 10.192.165.148 with SMTP id m20csp2655191imm; Sun, 22 Apr 2018 11:55:18 -0700 (PDT) X-Google-Smtp-Source: AIpwx48VLBF5E2kqgrQbbSb6GQymG9831aDJ7yyWJjP/y3U2hR41tGYOybXPh+Oy0NT7AK9z+QGD X-Received: by 2002:a17:902:51ce:: with SMTP id y72-v6mr17764459plh.157.1524423318291; Sun, 22 Apr 2018 11:55:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524423318; cv=none; d=google.com; s=arc-20160816; b=nVUC8ZhP3jthYmfIlT17ccqJGdPQXpOhP0Adv+dkmhcWMp0DtsFkvx5+4t8hyza3B/ u8BKpD+j6NlRX2Mmjj/4zEBWyR6rYqrPWb1JSxDWX8kWtt8LocxEhMV9QzpyHfOwPK9e 0H+dr0SOCOuUJWIxVbCQ4aYFf3KstTz55TpbVMsHCKfEPwm4RXlCZ4H8pSll4nqXjP0K 5KkiAhYtJLqYHyxAmEjLzOeJZAErdOe8VxoXO8T/4UIrpQ1UhlZJ+LoOQ5IGAacgGBKk 43Nq4LUFde2DpJWsJRXyKQ9Fp+VOgzR0o8bKkuN7BY/wvMtVHDMyObgp6mKgvevyH39b kqeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:content-transfer-encoding :spamdiagnosticmetadata:spamdiagnosticoutput:msip_labels :content-language:accept-language:in-reply-to:references:message-id :date:thread-index:thread-topic:subject:to:from:dkim-signature :arc-authentication-results; bh=m6Cd73Amv+kFFlrQG/0UPhOVi/JrvwjtjlCI4BQg+V0=; b=BtbaAoPrTBkRDJkgt4J5lDi2rDlHgiDrtpKGclu79Wbzwdcmlhq5NeS4kL9iK2cUXb 9gjzXIJwd9BI+319OkC3T0EM3eQJfQKASf5t1hiHMZxtG1mdo4M2cqcbewVWRydiOVd1 jYLqcrpLoqHzatYq/Bmng2DNmX8fTcZSTWlXNyGTw1/tXxfJjxLUzp1+yiObsHAmjjFk /9DuA9sBUTEYESiPO2NOD1kiDZsrnVZox4Z5CJBalLG8Ak8bK0wA9GepADW2d41xqNk1 XdoC/3hwKKEiiN6N1JePWgzsMj8wBTs7w8TzMUkyJhfDMaMbn0jMMDg4KsJLwlVv3CXF jslg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@microsoft.com header.s=selector1 header.b=gtdzaQWx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=microsoft.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v24-v6si10497597plo.490.2018.04.22.11.55.03; Sun, 22 Apr 2018 11:55:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@microsoft.com header.s=selector1 header.b=gtdzaQWx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753608AbeDVSx5 (ORCPT + 99 others); Sun, 22 Apr 2018 14:53:57 -0400 Received: from mail-cys01nam02on0104.outbound.protection.outlook.com ([104.47.37.104]:21411 "EHLO NAM02-CY1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753432AbeDVSxx (ORCPT ); Sun, 22 Apr 2018 14:53:53 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=m6Cd73Amv+kFFlrQG/0UPhOVi/JrvwjtjlCI4BQg+V0=; b=gtdzaQWxTs1I/IwFfwi4yvxGTn5N9aO0Ra3Ht0cA/ymQCrzpS67jWqGZroWIguVagrjjhL+46zQvo9f2E/QFrdQo2C9kWqASlfWtqkpLlSPwPCvWLQfbkar85gbNs2w5ZHvfvY8++rwWNEJo/LLUR4pc9+44iwi+ED/rIRWLcJM= Received: from DM5PR2101MB1030.namprd21.prod.outlook.com (52.132.128.11) by DM5PR2101MB1031.namprd21.prod.outlook.com (52.132.128.12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.715.4; Sun, 22 Apr 2018 18:53:50 +0000 Received: from DM5PR2101MB1030.namprd21.prod.outlook.com ([fe80::91b9:c1b0:20f2:8412]) by DM5PR2101MB1030.namprd21.prod.outlook.com ([fe80::91b9:c1b0:20f2:8412%2]) with mapi id 15.20.0715.012; Sun, 22 Apr 2018 18:53:50 +0000 From: "Michael Kelley (EOSG)" To: Long Li , KY Srinivasan , Haiyang Zhang , Stephen Hemminger , "Martin K . Petersen" , "devel@linuxdriverproject.org" , "linux-scsi@vger.kernel.org" , "linux-kernel@vger.kernel.org" Subject: RE: [Patch v2] Storvsc: Select channel based on available percentage of ring buffer to write Thread-Topic: [Patch v2] Storvsc: Select channel based on available percentage of ring buffer to write Thread-Index: AQHT2CkqR9YezLI5XE6FO01srmeVR6QJ19gA Date: Sun, 22 Apr 2018 18:53:50 +0000 Message-ID: References: <20180419215424.3557-1-longli@linuxonhyperv.com> In-Reply-To: <20180419215424.3557-1-longli@linuxonhyperv.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled=True; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId=72f988bf-86f1-41af-91ab-2d7cd011db47; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Owner=mikelley@ntdev.microsoft.com; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SetDate=2018-04-22T18:53:49.0243569Z; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Name=General; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Application=Microsoft Azure Information Protection; MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Extended_MSFT_Method=Automatic; Sensitivity=General x-originating-ip: [24.22.167.197] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;DM5PR2101MB1031;7:zm/0UltpFU2c6EqlpEO8ZUKbVDf/6vfggxbDqLksk3yAWbY/xlWfFyG3404Q0zvfIb+mV+dOthj9FzqzjfXjuE/v3i+VtrIpQwo2Sc5hHgIhuqRK21BgZpjHnVZFMBSiTeFluJEVf69xV8jNBk+CNIpdlFscX1+1B+4lYzXQpFqABEkK3r8J+/bjw/8XjN4rXK7UQyv/ikgIWTPLHLqF13tz17xwlPpu6awOu+cYQw5pUo6EONGxEhdSaKcs6tle;20:2AzyKWt3X0+6HLnUd97WGLWXnUl4Z6J7k5yEn84PJEeMpg4E7Ey0WkBGySeqIElvAai823v4keepYwtRV/IMqqJSfpYk22XEUIAnjT1pVB907CrkDi1lpDwroukLgAGBlRvjTOgsEnBBSmBAPIX4SOBVb9Ct8UeWZjh5+1jbXLA= x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(7020095)(4652020)(48565401081)(5600026)(2017052603328)(7193020);SRVR:DM5PR2101MB1031; x-ms-traffictypediagnostic: DM5PR2101MB1031: authentication-results: outbound.protection.outlook.com; spf=skipped (originating message); dkim=none (message not signed) header.d=none; dmarc=none action=none header.from=microsoft.com; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(28532068793085)(89211679590171)(9452136761055)(146099531331640); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(3231232)(944501410)(52105095)(3002001)(93006095)(93001095)(10201501046)(6055026)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123562045)(20161123564045)(20161123560045)(6072148)(201708071742011);SRVR:DM5PR2101MB1031;BCL:0;PCL:0;RULEID:;SRVR:DM5PR2101MB1031; x-forefront-prvs: 0650714AAA x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(376002)(366004)(39380400002)(39860400002)(346002)(396003)(13464003)(6246003)(81166006)(229853002)(3660700001)(53936002)(9686003)(5250100002)(2501003)(86612001)(476003)(6116002)(3846002)(25786009)(6506007)(53546011)(86362001)(2201001)(6436002)(575784001)(8936002)(8676002)(74316002)(55016002)(1511001)(66066001)(7736002)(305945005)(10290500003)(5660300001)(478600001)(110136005)(72206003)(11346002)(316002)(22452003)(7696005)(33656002)(2900100001)(76176011)(446003)(3280700002)(2906002)(26005)(59450400001)(102836004);DIR:OUT;SFP:1102;SCL:1;SRVR:DM5PR2101MB1031;H:DM5PR2101MB1030.namprd21.prod.outlook.com;FPR:;SPF:None;LANG:en;MLV:sfv; x-microsoft-antispam-message-info: Iq7PtflckQbqIjjFzqxs0oDzvI68SwVQEDZCocoAPqyY7AYi3cgcIDpzhwI0OvylgHLQwIwAZ3jTWXV3N4Ue/kYT5iybRUfTmPssyyhrBH9PfzaDpwICKy6FfwLb8JAnqmOrfKyYaTsQCsbRcvxZ7Mbhh5HAx6wjHAMnwPw45m5dpz8wkfBIDYOsNtP/OpNc spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Office365-Filtering-Correlation-Id: 315d3c19-3111-42e9-b97e-08d5a8826346 X-OriginatorOrg: microsoft.com X-MS-Exchange-CrossTenant-Network-Message-Id: 315d3c19-3111-42e9-b97e-08d5a8826346 X-MS-Exchange-CrossTenant-originalarrivaltime: 22 Apr 2018 18:53:50.5105 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR2101MB1031 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org On Behalf > Of Long Li > Sent: Thursday, April 19, 2018 2:54 PM > To: KY Srinivasan ; Haiyang Zhang ; Stephen > Hemminger ; James E . J . Bottomley ; > Martin K . Petersen ; devel@linuxdriverprojec= t.org; linux- > scsi@vger.kernel.org; linux-kernel@vger.kernel.org > Cc: Long Li > Subject: [Patch v2] Storvsc: Select channel based on available percentage= of ring buffer to > write >=20 > From: Long Li >=20 > This is a best effort for estimating on how busy the ring buffer is for > that channel, based on available buffer to write in percentage. It is sti= ll > possible that at the time of actual ring buffer write, the space may not = be > available due to other processes may be writing at the time. >=20 > Selecting a channel based on how full it is can reduce the possibility th= at > a ring buffer write will fail, and avoid the situation a channel is over > busy. >=20 > Now it's possible that storvsc can use a smaller ring buffer size > (e.g. 40k bytes) to take advantage of cache locality. >=20 > Changes. > v2: Pre-allocate struct cpumask on the heap. > Struct cpumask is a big structure (1k bytes) when CONFIG_NR_CPUS=3D8192 (= default > value when CONFIG_MAXSMP=3Dy). Don't use kernel stack for it by pre-alloc= ating > them using kmalloc when channels are first initialized. >=20 > Signed-off-by: Long Li > --- > drivers/scsi/storvsc_drv.c | 90 ++++++++++++++++++++++++++++++++++++----= ------ > 1 file changed, 72 insertions(+), 18 deletions(-) >=20 > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index a2ec0bc9e9fa..2a9fff94dd1a 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buff= er size > (bytes)"); >=20 > module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO); > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to subch= annels"); > + > +static int ring_avail_percent_lowater =3D 10; > +module_param(ring_avail_percent_lowater, int, S_IRUGO); > +MODULE_PARM_DESC(ring_avail_percent_lowater, > + "Select a channel if available ring size > this in percent"); > + > /* > * Timeout in seconds for all devices managed by this driver. > */ > @@ -468,6 +474,13 @@ struct storvsc_device { > * Mask of CPUs bound to subchannels. > */ > struct cpumask alloced_cpus; > + /* > + * Pre-allocated struct cpumask for each hardware queue. > + * struct cpumask is used by selecting out-going channels. It is a > + * big structure, default to 1024k bytes when CONFIG_MAXSMP=3Dy. I think you mean "1024 bytes" or "1k bytes" in the above comment. > + * Pre-allocate it to avoid allocation on the kernel stack. > + */ > + struct cpumask *cpumask_chns; > /* Used for vsc/vsp channel reset process */ > struct storvsc_cmd_request init_request; > struct storvsc_cmd_request reset_request; > @@ -872,6 +885,13 @@ static int storvsc_channel_init(struct hv_device *de= vice, bool is_fc) > if (stor_device->stor_chns =3D=3D NULL) > return -ENOMEM; >=20 > + stor_device->cpumask_chns =3D kcalloc(num_possible_cpus(), > + sizeof(struct cpumask), GFP_KERNEL); Note that num_possible_cpus() is 240 for a Hyper-V 2016 guest unless=20 overridden on the kernel boot line, so this is going to allocate 240 Kbytes= for each synthetic SCSI controller. On an Azure VM, which has two IDE and two SCSI controllers, this is nearly 1 Mbyte. It's unfortunate to have to allocate = this much memory for a what is essentially a temporary variable. Further down in th= ese comments, I've proposed an alternate implementation of the code that avoids the need for the temporary variable, and hence avoids the need for this allocation. > + if (stor_device->cpumask_chns =3D=3D NULL) { > + kfree(stor_device->stor_chns); > + return -ENOMEM; > + } > + > stor_device->stor_chns[device->channel->target_cpu] =3D device->channel= ; > cpumask_set_cpu(device->channel->target_cpu, > &stor_device->alloced_cpus); > @@ -1232,6 +1252,7 @@ static int storvsc_dev_remove(struct hv_device *dev= ice) > vmbus_close(device->channel); >=20 > kfree(stor_device->stor_chns); > + kfree(stor_device->cpumask_chns); > kfree(stor_device); > return 0; > } > @@ -1241,7 +1262,7 @@ static struct vmbus_channel *get_og_chn(struct stor= vsc_device > *stor_device, > {1G/ > u16 slot =3D 0; > u16 hash_qnum; > - struct cpumask alloced_mask; > + struct cpumask *alloced_mask =3D &stor_device->cpumask_chns[q_num]; > int num_channels, tgt_cpu; >=20 > if (stor_device->num_sc =3D=3D 0) > @@ -1257,10 +1278,10 @@ static struct vmbus_channel *get_og_chn(struct st= orvsc_device > *stor_device, > * III. Mapping is persistent. > */ >=20 > - cpumask_and(&alloced_mask, &stor_device->alloced_cpus, > + cpumask_and(alloced_mask, &stor_device->alloced_cpus, > cpumask_of_node(cpu_to_node(q_num))); >=20 > - num_channels =3D cpumask_weight(&alloced_mask); > + num_channels =3D cpumask_weight(alloced_mask); > if (num_channels =3D=3D 0) > return stor_device->device->channel; >=20 > @@ -1268,7 +1289,7 @@ static struct vmbus_channel *get_og_chn(struct stor= vsc_device > *stor_device, > while (hash_qnum >=3D num_channels) > hash_qnum -=3D num_channels; >=20 > - for_each_cpu(tgt_cpu, &alloced_mask) { > + for_each_cpu(tgt_cpu, alloced_mask) { > if (slot =3D=3D hash_qnum) > break; > slot++; Here's an alternate implementation of the core code in get_og_chn() that av= oids the need for a struct cpumask as a temporary variable. It checks the node cpumask on-the-fly, rather than precomputing the logical AND. This implementation might be slightly slower from a CPU standpoint, but the alloced_cpus mask is sparse, and get_og_chn() is only called a few times at the beginning of the usage of the synthetic SCSI controller, so it has no material impact. const struct cpumask *node_mask; node_mask =3D cpumask_of_node(cpu_to_node(q_num)); num_channels =3D 0; for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { if (cpumask_test_cpu(tgt_cpu, node_mask)) num_channels++; } if (num_channels =3D=3D 0) return stor_device->device->channel; hash_qnum =3D q_num; while (hash_qnum >=3D num_channels) hash_qnum -=3D num_channels; for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { if (!cpumask_test_cpu(tgt_cpu, node_mask)) continue; if (slot =3D=3D hash_qnum) break; slot++; } The same approach of checking the node cpumask on-the-fly instead of pre-computing can also be used in storvsc_do_io(). The perf impact in storvsc_do_io() is probably nil. Then with both uses of the temp cpumask eliminated, the large memory allocation for it could be removed. > @@ -1285,9 +1306,9 @@ static int storvsc_do_io(struct hv_device *device, > { > struct storvsc_device *stor_device; > struct vstor_packet *vstor_packet; > - struct vmbus_channel *outgoing_channel; > + struct vmbus_channel *outgoing_channel, *channel; > int ret =3D 0; > - struct cpumask alloced_mask; > + struct cpumask *alloced_mask; > int tgt_cpu; >=20 > vstor_packet =3D &request->vstor_packet; > @@ -1301,22 +1322,53 @@ static int storvsc_do_io(struct hv_device *device= , > /* > * Select an an appropriate channel to send the request out. > */ > - > if (stor_device->stor_chns[q_num] !=3D NULL) { > outgoing_channel =3D stor_device->stor_chns[q_num]; > - if (outgoing_channel->target_cpu =3D=3D smp_processor_id()) { > + if (outgoing_channel->target_cpu =3D=3D q_num) { > /* > * Ideally, we want to pick a different channel if > * available on the same NUMA node. > */ > - cpumask_and(&alloced_mask, &stor_device->alloced_cpus, > + alloced_mask =3D &stor_device->cpumask_chns[q_num]; > + cpumask_and(alloced_mask, &stor_device->alloced_cpus, > cpumask_of_node(cpu_to_node(q_num))); > - for_each_cpu_wrap(tgt_cpu, &alloced_mask, > - outgoing_channel->target_cpu + 1) { > - if (tgt_cpu !=3D outgoing_channel->target_cpu) { > - outgoing_channel =3D > - stor_device->stor_chns[tgt_cpu]; > - break; > + > + for_each_cpu_wrap(tgt_cpu, alloced_mask, q_num + 1) { > + if (tgt_cpu =3D=3D q_num) > + continue; > + channel =3D stor_device->stor_chns[tgt_cpu]; > + if (hv_get_avail_to_write_percent( > + &channel->outbound) > + > ring_avail_percent_lowater) { > + outgoing_channel =3D channel; > + goto found_channel; > + } > + } > + > + /* > + * All the other channels on the same NUMA node are > + * busy. Try to use the channel on the current CPU > + */ > + if (hv_get_avail_to_write_percent( > + &outgoing_channel->outbound) > + > ring_avail_percent_lowater) > + goto found_channel; > + > + /* > + * If we reach here, all the channels on the current > + * NUMA node are busy. Try to find a channel in > + * other NUMA nodes > + */ > + cpumask_andnot(alloced_mask, &stor_device->alloced_cpus, > + cpumask_of_node(cpu_to_node(q_num))); > + > + for_each_cpu(tgt_cpu, alloced_mask) { > + channel =3D stor_device->stor_chns[tgt_cpu]; > + if (hv_get_avail_to_write_percent( > + &channel->outbound) > + > ring_avail_percent_lowater) { > + outgoing_channel =3D channel; > + goto found_channel; > } > } > } > @@ -1324,7 +1376,7 @@ static int storvsc_do_io(struct hv_device *device, > outgoing_channel =3D get_og_chn(stor_device, q_num); > } >=20 > - > +found_channel: > vstor_packet->flags |=3D REQUEST_COMPLETION_FLAG; >=20 > vstor_packet->vm_srb.length =3D (sizeof(struct vmscsi_request) - > @@ -1732,8 +1784,9 @@ static int storvsc_probe(struct hv_device *device, > (num_cpus - 1) / storvsc_vcpus_per_sub_channel; > } >=20 > - scsi_driver.can_queue =3D (max_outstanding_req_per_channel * > - (max_sub_channels + 1)); > + scsi_driver.can_queue =3D max_outstanding_req_per_channel * > + (max_sub_channels + 1) * > + (100 - ring_avail_percent_lowater) / 100; >=20 > host =3D scsi_host_alloc(&scsi_driver, > sizeof(struct hv_host_device)); > @@ -1864,6 +1917,7 @@ static int storvsc_probe(struct hv_device *device, >=20 > err_out1: > kfree(stor_device->stor_chns); > + kfree(stor_device->cpumask_chns); > kfree(stor_device); I don't think the memory freeing in err_out1 is correct. You end up in err_out1 when storvsc_connect_to_vsp() fails. But it could fail in multiple scenarios, some of which leave cpumask_chns allocated, and some of which don't. As a result you could free memory that isn't allocat= ed or that has already been freed. In fact, the same problem already exists for stor_chns. I'm not sure about stor_device itself. >=20 > err_out0: > -- > 2.14.1