Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
From:   "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
To:     Long Li <longli@microsoft.com>, KY Srinivasan <kys@microsoft.com>,
        Haiyang Zhang <haiyangz@microsoft.com>,
        Stephen Hemminger <sthemmin@microsoft.com>,
        "Martin K . Petersen" <martin.petersen@oracle.com>,
        "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>,
        "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [Patch v2] Storvsc: Select channel based on available percentage
 of ring buffer to write
Thread-Topic: [Patch v2] Storvsc: Select channel based on available percentage
 of ring buffer to write
Thread-Index: AQHT2CkqR9YezLI5XE6FO01srmeVR6QJ19gA
Date:   Sun, 22 Apr 2018 18:53:50 +0000
Message-ID: <DM5PR2101MB1030DC17E849B106A223E40EDC8A0@DM5PR2101MB1030.namprd21.prod.outlook.com>
References: <20180419215424.3557-1-longli@linuxonhyperv.com>
In-Reply-To: <20180419215424.3557-1-longli@linuxonhyperv.com>
Accept-Language: en-US
Content-Language: en-US
msip_labels: MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled=True;
 MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId=72f988bf-86f1-41af-91ab-2d7cd011db47;
 MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Owner=mikelley@ntdev.microsoft.com;
 MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SetDate=2018-04-22T18:53:49.0243569Z;
 MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Name=General;
 MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Application=Microsoft Azure
 Information Protection;
 MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Extended_MSFT_Method=Automatic;
 Sensitivity=General
x-originating-ip: [24.22.167.197]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1;DM5PR2101MB1031;7:zm/0UltpFU2c6EqlpEO8ZUKbVDf/6vfggxbDqLksk3yAWbY/xlWfFyG3404Q0zvfIb+mV+dOthj9FzqzjfXjuE/v3i+VtrIpQwo2Sc5hHgIhuqRK21BgZpjHnVZFMBSiTeFluJEVf69xV8jNBk+CNIpdlFscX1+1B+4lYzXQpFqABEkK3r8J+/bjw/8XjN4rXK7UQyv/ikgIWTPLHLqF13tz17xwlPpu6awOu+cYQw5pUo6EONGxEhdSaKcs6tle;20:2AzyKWt3X0+6HLnUd97WGLWXnUl4Z6J7k5yEn84PJEeMpg4E7Ey0WkBGySeqIElvAai823v4keepYwtRV/IMqqJSfpYk22XEUIAnjT1pVB907CrkDi1lpDwroukLgAGBlRvjTOgsEnBBSmBAPIX4SOBVb9Ct8UeWZjh5+1jbXLA=
x-ms-exchange-antispam-srfa-diagnostics: SOS;
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(7020095)(4652020)(48565401081)(5600026)(2017052603328)(7193020);SRVR:DM5PR2101MB1031;
x-ms-traffictypediagnostic: DM5PR2101MB1031:
authentication-results: outbound.protection.outlook.com; spf=skipped
 (originating message); dkim=none (message not signed) header.d=none;
 dmarc=none action=none header.from=microsoft.com;
x-microsoft-antispam-prvs: <DM5PR2101MB10310CD228373B1ECD144003DC8A0@DM5PR2101MB1031.namprd21.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(28532068793085)(89211679590171)(9452136761055)(146099531331640);
x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(3231232)(944501410)(52105095)(3002001)(93006095)(93001095)(10201501046)(6055026)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123562045)(20161123564045)(20161123560045)(6072148)(201708071742011);SRVR:DM5PR2101MB1031;BCL:0;PCL:0;RULEID:;SRVR:DM5PR2101MB1031;
x-forefront-prvs: 0650714AAA
x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(376002)(366004)(39380400002)(39860400002)(346002)(396003)(13464003)(6246003)(81166006)(229853002)(3660700001)(53936002)(9686003)(5250100002)(2501003)(86612001)(476003)(6116002)(3846002)(25786009)(6506007)(53546011)(86362001)(2201001)(6436002)(575784001)(8936002)(8676002)(74316002)(55016002)(1511001)(66066001)(7736002)(305945005)(10290500003)(5660300001)(478600001)(110136005)(72206003)(11346002)(316002)(22452003)(7696005)(33656002)(2900100001)(76176011)(446003)(3280700002)(2906002)(26005)(59450400001)(102836004);DIR:OUT;SFP:1102;SCL:1;SRVR:DM5PR2101MB1031;H:DM5PR2101MB1030.namprd21.prod.outlook.com;FPR:;SPF:None;LANG:en;MLV:sfv;
x-microsoft-antispam-message-info: Iq7PtflckQbqIjjFzqxs0oDzvI68SwVQEDZCocoAPqyY7AYi3cgcIDpzhwI0OvylgHLQwIwAZ3jTWXV3N4Ue/kYT5iybRUfTmPssyyhrBH9PfzaDpwICKy6FfwLb8JAnqmOrfKyYaTsQCsbRcvxZ7Mbhh5HAx6wjHAMnwPw45m5dpz8wkfBIDYOsNtP/OpNc
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Office365-Filtering-Correlation-Id: 315d3c19-3111-42e9-b97e-08d5a8826346
X-OriginatorOrg: microsoft.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 315d3c19-3111-42e9-b97e-08d5a8826346
X-MS-Exchange-CrossTenant-originalarrivaltime: 22 Apr 2018 18:53:50.5105
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR2101MB1031
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.=
org> On Behalf
> Of Long Li
> Sent: Thursday, April 19, 2018 2:54 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.=
com>; Stephen
> Hemminger <sthemmin@microsoft.com>; James E . J . Bottomley <JBottomley@o=
din.com>;
> Martin K . Petersen <martin.petersen@oracle.com>; devel@linuxdriverprojec=
t.org; linux-
> scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: Long Li <longli@microsoft.com>
> Subject: [Patch v2] Storvsc: Select channel based on available percentage=
 of ring buffer to
> write
>=20
> From: Long Li <longli@microsoft.com>
>=20
> This is a best effort for estimating on how busy the ring buffer is for
> that channel, based on available buffer to write in percentage. It is sti=
ll
> possible that at the time of actual ring buffer write, the space may not =
be
> available due to other processes may be writing at the time.
>=20
> Selecting a channel based on how full it is can reduce the possibility th=
at
> a ring buffer write will fail, and avoid the situation a channel is over
> busy.
>=20
> Now it's possible that storvsc can use a smaller ring buffer size
> (e.g. 40k bytes) to take advantage of cache locality.
>=20
> Changes.
> v2: Pre-allocate struct cpumask on the heap.
> Struct cpumask is a big structure (1k bytes) when CONFIG_NR_CPUS=3D8192 (=
default
> value when CONFIG_MAXSMP=3Dy). Don't use kernel stack for it by pre-alloc=
ating
> them using kmalloc when channels are first initialized.
>=20
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 90 ++++++++++++++++++++++++++++++++++++----=
------
>  1 file changed, 72 insertions(+), 18 deletions(-)
>=20
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index a2ec0bc9e9fa..2a9fff94dd1a 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buff=
er size
> (bytes)");
>=20
>  module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
>  MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to subch=
annels");
> +
> +static int ring_avail_percent_lowater =3D 10;
> +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> +MODULE_PARM_DESC(ring_avail_percent_lowater,
> +		"Select a channel if available ring size > this in percent");
> +
>  /*
>   * Timeout in seconds for all devices managed by this driver.
>   */
> @@ -468,6 +474,13 @@ struct storvsc_device {
>  	 * Mask of CPUs bound to subchannels.
>  	 */
>  	struct cpumask alloced_cpus;
> +	/*
> +	 * Pre-allocated struct cpumask for each hardware queue.
> +	 * struct cpumask is used by selecting out-going channels. It is a
> +	 * big structure, default to 1024k bytes when CONFIG_MAXSMP=3Dy.

I think you mean "1024 bytes" or "1k bytes" in the above comment.

> +	 * Pre-allocate it to avoid allocation on the kernel stack.
> +	 */
> +	struct cpumask *cpumask_chns;
>  	/* Used for vsc/vsp channel reset process */
>  	struct storvsc_cmd_request init_request;
>  	struct storvsc_cmd_request reset_request;
> @@ -872,6 +885,13 @@ static int storvsc_channel_init(struct hv_device *de=
vice, bool is_fc)
>  	if (stor_device->stor_chns =3D=3D NULL)
>  		return -ENOMEM;
>=20
> +	stor_device->cpumask_chns =3D kcalloc(num_possible_cpus(),
> +			sizeof(struct cpumask), GFP_KERNEL);

Note that num_possible_cpus() is 240 for a Hyper-V 2016 guest unless=20
overridden on the kernel boot line, so this is going to allocate 240 Kbytes=
 for each
synthetic SCSI controller.  On an Azure VM, which has two IDE and two SCSI
controllers, this is nearly 1 Mbyte.  It's unfortunate to have to allocate =
this much
memory for a what is essentially a temporary variable.   Further down in th=
ese
comments, I've proposed an alternate implementation of the code that avoids
the need for the temporary variable, and hence avoids the need for this
allocation.

> +	if (stor_device->cpumask_chns =3D=3D NULL) {
> +		kfree(stor_device->stor_chns);
> +		return -ENOMEM;
> +	}
> +
>  	stor_device->stor_chns[device->channel->target_cpu] =3D device->channel=
;
>  	cpumask_set_cpu(device->channel->target_cpu,
>  			&stor_device->alloced_cpus);
> @@ -1232,6 +1252,7 @@ static int storvsc_dev_remove(struct hv_device *dev=
ice)
>  	vmbus_close(device->channel);
>=20
>  	kfree(stor_device->stor_chns);
> +	kfree(stor_device->cpumask_chns);
>  	kfree(stor_device);
>  	return 0;
>  }
> @@ -1241,7 +1262,7 @@ static struct vmbus_channel *get_og_chn(struct stor=
vsc_device
> *stor_device,
>  {1G/
>  	u16 slot =3D 0;
>  	u16 hash_qnum;
> -	struct cpumask alloced_mask;
> +	struct cpumask *alloced_mask =3D &stor_device->cpumask_chns[q_num];
>  	int num_channels, tgt_cpu;
>=20
>  	if (stor_device->num_sc =3D=3D 0)
> @@ -1257,10 +1278,10 @@ static struct vmbus_channel *get_og_chn(struct st=
orvsc_device
> *stor_device,
>  	 * III. Mapping is persistent.
>  	 */
>=20
> -	cpumask_and(&alloced_mask, &stor_device->alloced_cpus,
> +	cpumask_and(alloced_mask, &stor_device->alloced_cpus,
>  		    cpumask_of_node(cpu_to_node(q_num)));
>=20
> -	num_channels =3D cpumask_weight(&alloced_mask);
> +	num_channels =3D cpumask_weight(alloced_mask);
>  	if (num_channels =3D=3D 0)
>  		return stor_device->device->channel;
>=20
> @@ -1268,7 +1289,7 @@ static struct vmbus_channel *get_og_chn(struct stor=
vsc_device
> *stor_device,
>  	while (hash_qnum >=3D num_channels)
>  		hash_qnum -=3D num_channels;
>=20
> -	for_each_cpu(tgt_cpu, &alloced_mask) {
> +	for_each_cpu(tgt_cpu, alloced_mask) {
>  		if (slot =3D=3D hash_qnum)
>  			break;
>  		slot++;

Here's an alternate implementation of the core code in get_og_chn() that av=
oids
the need for a struct cpumask as a temporary variable.  It checks the node
cpumask on-the-fly, rather than precomputing the logical AND.  This
implementation might be slightly slower from a CPU standpoint, but the
alloced_cpus mask is sparse, and get_og_chn() is only called a few
times at the beginning of the usage of the synthetic SCSI controller, so it
has no material impact.

	const struct cpumask *node_mask;

	node_mask =3D cpumask_of_node(cpu_to_node(q_num));

	num_channels =3D 0;
	for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
		if (cpumask_test_cpu(tgt_cpu, node_mask))
			num_channels++;
	}
	if (num_channels =3D=3D 0)
		return stor_device->device->channel;

	hash_qnum =3D q_num;
	while (hash_qnum >=3D num_channels)
		hash_qnum -=3D num_channels;

	for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
		if (!cpumask_test_cpu(tgt_cpu, node_mask))
			continue;
		if (slot =3D=3D hash_qnum)
			break;
		slot++;
	}

The same approach of checking the node cpumask on-the-fly instead of
pre-computing can also be used in storvsc_do_io().   The perf impact in
storvsc_do_io() is probably nil.   Then with both uses of the temp cpumask
eliminated, the large memory allocation for it could be removed.

> @@ -1285,9 +1306,9 @@ static int storvsc_do_io(struct hv_device *device,
>  {
>  	struct storvsc_device *stor_device;
>  	struct vstor_packet *vstor_packet;
> -	struct vmbus_channel *outgoing_channel;
> +	struct vmbus_channel *outgoing_channel, *channel;
>  	int ret =3D 0;
> -	struct cpumask alloced_mask;
> +	struct cpumask *alloced_mask;
>  	int tgt_cpu;
>=20
>  	vstor_packet =3D &request->vstor_packet;
> @@ -1301,22 +1322,53 @@ static int storvsc_do_io(struct hv_device *device=
,
>  	/*
>  	 * Select an an appropriate channel to send the request out.
>  	 */
> -
>  	if (stor_device->stor_chns[q_num] !=3D NULL) {
>  		outgoing_channel =3D stor_device->stor_chns[q_num];
> -		if (outgoing_channel->target_cpu =3D=3D smp_processor_id()) {
> +		if (outgoing_channel->target_cpu =3D=3D q_num) {
>  			/*
>  			 * Ideally, we want to pick a different channel if
>  			 * available on the same NUMA node.
>  			 */
> -			cpumask_and(&alloced_mask, &stor_device->alloced_cpus,
> +			alloced_mask =3D &stor_device->cpumask_chns[q_num];
> +			cpumask_and(alloced_mask, &stor_device->alloced_cpus,
>  				    cpumask_of_node(cpu_to_node(q_num)));
> -			for_each_cpu_wrap(tgt_cpu, &alloced_mask,
> -					outgoing_channel->target_cpu + 1) {
> -				if (tgt_cpu !=3D outgoing_channel->target_cpu) {
> -					outgoing_channel =3D
> -					stor_device->stor_chns[tgt_cpu];
> -					break;
> +
> +			for_each_cpu_wrap(tgt_cpu, alloced_mask, q_num + 1) {
> +				if (tgt_cpu =3D=3D q_num)
> +					continue;
> +				channel =3D stor_device->stor_chns[tgt_cpu];
> +				if (hv_get_avail_to_write_percent(
> +							&channel->outbound)
> +						> ring_avail_percent_lowater) {
> +					outgoing_channel =3D channel;
> +					goto found_channel;
> +				}
> +			}
> +
> +			/*
> +			 * All the other channels on the same NUMA node are
> +			 * busy. Try to use the channel on the current CPU
> +			 */
> +			if (hv_get_avail_to_write_percent(
> +						&outgoing_channel->outbound)
> +					> ring_avail_percent_lowater)
> +				goto found_channel;
> +
> +			/*
> +			 * If we reach here, all the channels on the current
> +			 * NUMA node are busy. Try to find a channel in
> +			 * other NUMA nodes
> +			 */
> +			cpumask_andnot(alloced_mask, &stor_device->alloced_cpus,
> +					cpumask_of_node(cpu_to_node(q_num)));
> +
> +			for_each_cpu(tgt_cpu, alloced_mask) {
> +				channel =3D stor_device->stor_chns[tgt_cpu];
> +				if (hv_get_avail_to_write_percent(
> +							&channel->outbound)
> +						> ring_avail_percent_lowater) {
> +					outgoing_channel =3D channel;
> +					goto found_channel;
>  				}
>  			}
>  		}
> @@ -1324,7 +1376,7 @@ static int storvsc_do_io(struct hv_device *device,
>  		outgoing_channel =3D get_og_chn(stor_device, q_num);
>  	}
>=20
> -
> +found_channel:
>  	vstor_packet->flags |=3D REQUEST_COMPLETION_FLAG;
>=20
>  	vstor_packet->vm_srb.length =3D (sizeof(struct vmscsi_request) -
> @@ -1732,8 +1784,9 @@ static int storvsc_probe(struct hv_device *device,
>  				(num_cpus - 1) / storvsc_vcpus_per_sub_channel;
>  	}
>=20
> -	scsi_driver.can_queue =3D (max_outstanding_req_per_channel *
> -				 (max_sub_channels + 1));
> +	scsi_driver.can_queue =3D max_outstanding_req_per_channel *
> +				(max_sub_channels + 1) *
> +				(100 - ring_avail_percent_lowater) / 100;
>=20
>  	host =3D scsi_host_alloc(&scsi_driver,
>  			       sizeof(struct hv_host_device));
> @@ -1864,6 +1917,7 @@ static int storvsc_probe(struct hv_device *device,
>=20
>  err_out1:
>  	kfree(stor_device->stor_chns);
> +	kfree(stor_device->cpumask_chns);
>  	kfree(stor_device);

I don't think the memory freeing in err_out1 is correct.  You end up
in err_out1 when storvsc_connect_to_vsp() fails.  But it could fail in
multiple scenarios, some of which leave cpumask_chns allocated, and
some of which don't.   As a result you could free memory that isn't allocat=
ed
or that has already been freed.  In fact, the same problem already exists
for stor_chns.  I'm not sure about stor_device itself.

>=20
>  err_out0:
> --
> 2.14.1