Received: by 2002:ab2:6a05:0:b0:1f8:1780:a4ed with SMTP id w5csp525654lqo; Fri, 10 May 2024 07:09:42 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVieQKhSslqeMN8/u9us3VmO66zGimeTP7zD5c5eHW2DFj2oELaHC4DSWLFtj7TSnOtgiSP1Jf9o3ArWzLUKE0oxFwgIr1JdImw/S1ucQ== X-Google-Smtp-Source: AGHT+IGIoRdoyM51n+1GUUzcVS2E9806MqRkcrcDZHIahhHKj+vJCn4ZgSyUrTc0jUrYLiqzNTfh X-Received: by 2002:a17:906:3809:b0:a59:a05c:b553 with SMTP id a640c23a62f3a-a5a2d572471mr291615266b.17.1715350182023; Fri, 10 May 2024 07:09:42 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1715350182; cv=pass; d=google.com; s=arc-20160816; b=XNXKWo5x/Yq0fY3t3SlV25s1dECVbXTYeKmGOqN44OkdCU3B8P9tRqxjv4D0TI+yBi 4VBD9jUJrNCcm2mfwkAF1IsIEYdiFGES+DQz5Qb0OHKLIRGp/MxT0z4dyijSJST3XkVa dSW2D8iTNmXRoSjU38X0yac9ai0sbLJwX19scnBBD1G5uowhTApZXaGSO4oYckYDIVxM ypn+j0sKIJ+SjWpnOAjyAqJ7m+2pVx7gjYUlXiNrvv2iv2x1mGlLtdcBlwtGpdgt7oB9 lfQEk//jm70QRhTl845zTeb8UOJFLDdapYK9deEw6AdNe6yNrUqLx0lW4UgNC1cgm7J/ ZAEg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:in-reply-to:subject:cc:to:from:date :dkim-signature; bh=k8FkXX8ag030wY75AQZvmJGfyiDrhLTTaDh51EZd1JY=; fh=bg7P596Gu3my9x1HIInHV/PCvc53gc8ZrS9Bek1uBcs=; b=JpuaXcSuE0eYIRlia2gHlZWfSO6KACvke6RBqWevkONheIp3cEftTJFa54O0WF1aNV WmrF1RAJlqO0uK7F3sYLtxl5UomWW54zgiflLeB2y1WpPDgEQFhyDj8V24+3NEDMJ1ra 5fJIfBlKzfANBBFkDz0GyQFPJUD6FZFimJ/y0LHNI/9rucbt9Mm2nK05OAFgdCM+33fn xRasgKl+kEvXCoeleTGkTn2HrHPWluegDu3P395MG8TpYi0iuToBKABrVZHk/diuSBQr HEeRRZsfTmLq9ngCEV97Iy99mvgfwbDp94i/6uOArG5yNiNlfj8tMcOdUb1eNZhQ4SbT fTnA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MASFm11c; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-175866-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-175866-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id a640c23a62f3a-a5a17ba3a5asi212258866b.591.2024.05.10.07.09.41 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 May 2024 07:09:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-175866-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MASFm11c; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-175866-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-175866-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 976681F2417D for ; Fri, 10 May 2024 14:09:41 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 9619F12D1FD; Fri, 10 May 2024 14:08:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MASFm11c" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C33A712CD89 for ; Fri, 10 May 2024 14:08:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715350090; cv=none; b=EKEJNUNtrdLfXiZx7BvxX2E6HloiJe70TCMu/h7pzqU6YzVR8AP37Vk8ZDNDMISCbr5Z8wFdn9BsdnERJE5VlaYKRW8DcfM6IViavbsb5xphn2ozYJPCMs1LLFOs49T3rqcEVMhYANOlw9wscK4yVm/bSzB4GfEIxZjFSn5cEKU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715350090; c=relaxed/simple; bh=GY8QKF6N/5yxj3g4eHK9Mo7u5V+hozEk12ERqyApuWw=; h=Date:From:To:cc:Subject:In-Reply-To:Message-ID:References: MIME-Version:Content-Type; b=WAxZjM/d81P1M67AjT2N+0wPxo4lWLI30L6riJqoWJ0rwlSjFfubukupJiBivR21wloa0YLUtS3cwJf3Fk6SLT/hAG1oelHRtpXmYA9J4VC0r1Wb1yv92J0mRPyJcoXQDw0KlGqDcxGBKEaU5fjsLqaRVa4qK0OZTFRyT2M6k9A= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=MASFm11c; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715350087; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=k8FkXX8ag030wY75AQZvmJGfyiDrhLTTaDh51EZd1JY=; b=MASFm11cSRjBFCjI1Yl4AlqTO8vy1HF/e0mOcsPrOaN4lNSTHRKl9Xx8ctAs1hKCL1O/kP pjo8lT7AU1IQSarO7mO4LL3jmjOqHVjMSVj5l4yBAousGYFnSYUZ3i6gYbSw419kBkAj0k /TOlz8jMYmSNcAhwP8qzbTYKi+6tXrQ= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-381-cJOXsvMFPxqCMZeoTfqDlg-1; Fri, 10 May 2024 10:08:05 -0400 X-MC-Unique: cJOXsvMFPxqCMZeoTfqDlg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 35B6C1C02151; Fri, 10 May 2024 14:08:05 +0000 (UTC) Received: from file1-rdu.file-001.prod.rdu2.dc.redhat.com (unknown [10.11.5.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2A58F49103D; Fri, 10 May 2024 14:08:05 +0000 (UTC) Received: by file1-rdu.file-001.prod.rdu2.dc.redhat.com (Postfix, from userid 12668) id 03E1C30C1B93; Fri, 10 May 2024 14:08:04 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by file1-rdu.file-001.prod.rdu2.dc.redhat.com (Postfix) with ESMTP id F1CE43FB52; Fri, 10 May 2024 16:08:04 +0200 (CEST) Date: Fri, 10 May 2024 16:08:04 +0200 (CEST) From: Mikulas Patocka To: Yang Yang cc: Alasdair Kergon , Mike Snitzer , dm-devel@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] dm: Avoid sending redundant empty flush bios to the same block device In-Reply-To: <20240422100540.2213-1-yang.yang@vivo.com> Message-ID: <1f741e4d-c33f-a2e3-f4dd-d7f613443534@redhat.com> References: <20240422100540.2213-1-yang.yang@vivo.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.10 Hi Regarding *bitmap = bitmap_zalloc(t->num_devices + 1, GFP_KERNEL); - you can't allocate memory in the I/O processing path, because it may deadlock. Think of a case when the system is out of memory and it needs to swap out some pages and the swap out operation attempts to allocate more memory. Anyway, the patch is too complicated. I suggest to try this: * introduce a per-target bit "flush_pass_around" that is set for dm-linear and dm-stripe and that means that the target supports flush optimization * set a per-table "flush_pass_around" bit if all the targets in the table have "flush_pass_around" set * in __send_empty_flush, test the table's "flush_pass_around" bit and if it is set, bypass this loop "for (unsigned int i = 0; i < t->num_targets; i++) {" and iterate over dm_table->devices and send the flush to each of them. Hopefully, these changes will make the patch smaller and more acceptable. Mikulas On Mon, 22 Apr 2024, Yang Yang wrote: > __send_empty_flush() sends empty flush bios to every target in the > dm_table. However, if the num_targets exceeds the number of block > devices in the dm_table's device list, it could lead to multiple > invocations of __send_duplicate_bios() for the same block device. > Typically, a single thread sending numerous empty flush bios to one > block device is redundant, as these bios are likely to be merged by the > flush state machine. In scenarios where num_targets significantly > outweighs the number of block devices, such behavior may result in a > noteworthy decrease in performance. > > This issue can be reproduced using this command line: > for i in {0..1023}; do > echo $((8000*$i)) 8000 linear /dev/sda2 $((16384*$i)) > done | dmsetup create example > > With this fix, a random write with fsync workload executed with the > following fio command: > > fio --group_reporting --name=benchmark --filename=/dev/mapper/example \ > --ioengine=sync --invalidate=1 --numjobs=16 --rw=randwrite \ > --blocksize=4k --size=2G --time_based --runtime=30 --fdatasync=1 > > results in an increase from 857 KB/s to 30.8 MB/s of the write > throughput (3580% increase). > > Signed-off-by: Yang Yang > --- > drivers/md/dm-core.h | 1 + > drivers/md/dm-table.c | 7 +++++ > drivers/md/dm.c | 59 +++++++++++++++++++++++++++++++++++ > include/linux/device-mapper.h | 6 ++++ > 4 files changed, 73 insertions(+) > > diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h > index e6757a30dcca..7e3f2168289f 100644 > --- a/drivers/md/dm-core.h > +++ b/drivers/md/dm-core.h > @@ -217,6 +217,7 @@ struct dm_table { > /* a list of devices used by this table */ > struct list_head devices; > struct rw_semaphore devices_lock; > + unsigned short num_devices; > > /* events get handed up using this callback */ > void (*event_fn)(void *data); > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index 41f1d731ae5a..ddc60e498afb 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -2133,6 +2133,8 @@ void dm_table_postsuspend_targets(struct dm_table *t) > > int dm_table_resume_targets(struct dm_table *t) > { > + struct list_head *devices = dm_table_get_devices(t); > + struct dm_dev_internal *dd; > unsigned int i; > int r = 0; > > @@ -2159,6 +2161,11 @@ int dm_table_resume_targets(struct dm_table *t) > ti->type->resume(ti); > } > > + t->num_devices = 0; > + > + list_for_each_entry(dd, devices, list) > + dd->dm_dev->index = ++(t->num_devices); > + > return 0; > } > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 56aa2a8b9d71..7297235291f6 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -48,6 +48,8 @@ > */ > #define REQ_DM_POLL_LIST REQ_DRV > > +#define DM_MAX_TABLE_DEVICES 1024 > + > static const char *_name = DM_NAME; > > static unsigned int major; > @@ -1543,10 +1545,38 @@ static unsigned int __send_duplicate_bios(struct clone_info *ci, struct dm_targe > return ret; > } > > +static inline bool has_redundant_flush(struct dm_table *t, > + unsigned long **bitmap) > +{ > + if (t->num_devices < t->num_targets) { > + /* Add a limit here to prevent excessive memory usage for bitmaps */ > + if (t->num_devices >= DM_MAX_TABLE_DEVICES) > + return false; > + > + /* dm_dev's index starts from 1, so need plus 1 here */ > + *bitmap = bitmap_zalloc(t->num_devices + 1, GFP_KERNEL); > + if (*bitmap) > + return true; > + } > + > + return false; > +} > + > +static int dm_get_dev_index(struct dm_target *ti, struct dm_dev *dev, > + sector_t start, sector_t len, void *data) > +{ > + unsigned short *index = data; > + *index = dev->index; > + return 0; > +} > + > static void __send_empty_flush(struct clone_info *ci) > { > struct dm_table *t = ci->map; > struct bio flush_bio; > + unsigned long *handled_map; > + unsigned int nr_handled = 0; > + bool check = has_redundant_flush(t, &handled_map); > > /* > * Use an on-stack bio for this, it's safe since we don't > @@ -1562,17 +1592,46 @@ static void __send_empty_flush(struct clone_info *ci) > > for (unsigned int i = 0; i < t->num_targets; i++) { > unsigned int bios; > + unsigned short index = 0; > struct dm_target *ti = dm_table_get_target(t, i); > > if (unlikely(ti->num_flush_bios == 0)) > continue; > > + /* > + * If the num_targets is greater than the number of block devices > + * in the dm_table's devices list, __send_empty_flush() might > + * invoke __send_duplicate_bios() multiple times for the same > + * block device. This could lead to a substantial decrease in > + * performance when num_targets significantly exceeds the number > + * of block devices. > + * Ensure that __send_duplicate_bios() is only called once for > + * each block device. > + */ > + if (check) { > + if (nr_handled == t->num_devices) > + break; > + > + if (ti->type->iterate_devices) > + ti->type->iterate_devices(ti, dm_get_dev_index, &index); > + > + if (index > 0) { > + if (__test_and_set_bit(index, handled_map)) > + continue; > + else > + nr_handled++; > + } > + } > + > atomic_add(ti->num_flush_bios, &ci->io->io_count); > bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios, > NULL, GFP_NOWAIT); > atomic_sub(ti->num_flush_bios - bios, &ci->io->io_count); > } > > + if (check) > + bitmap_free(handled_map); > + > /* > * alloc_io() takes one extra reference for submission, so the > * reference won't reach 0 without the following subtraction > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h > index 82b2195efaca..4a54b4f0a609 100644 > --- a/include/linux/device-mapper.h > +++ b/include/linux/device-mapper.h > @@ -169,6 +169,12 @@ struct dm_dev { > struct dax_device *dax_dev; > blk_mode_t mode; > char name[16]; > + > + /* > + * sequential number for each dm_dev in dm_table's devices list, > + * start from 1 > + */ > + unsigned short index; > }; > > /* > -- > 2.34.1 >