Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936132AbcLOMCz (ORCPT ); Thu, 15 Dec 2016 07:02:55 -0500 Received: from mail-dm3nam03on0128.outbound.protection.outlook.com ([104.47.41.128]:26635 "EHLO NAM03-DM3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932595AbcLOMCw (ORCPT ); Thu, 15 Dec 2016 07:02:52 -0500 From: Dexuan Cui To: Jens Axboe , "Theodore Ts'o" , "Andreas Dilger" , "linux-block@vger.kernel.org" , "linux-ext4@vger.kernel.org" CC: "linux-kernel@vger.kernel.org" , Abel Hu , Thomas Shao , Matthew Wilcox , Long Li , KY Srinivasan Subject: Big I/O requests are split into small ones due to unaligned ext4 partition boundary? Thread-Topic: Big I/O requests are split into small ones due to unaligned ext4 partition boundary? Thread-Index: AdJWrOFbDLU/PpLnQsCA15gterVoiw== Date: Thu, 15 Dec 2016 11:47:24 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=decui@microsoft.com; x-originating-ip: [2404:f801:9000:19::259] x-ms-office365-filtering-correlation-id: 5a4e778d-633d-41f5-2252-08d424e02362 x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001);SRVR:MWHPR03MB2494; x-microsoft-exchange-diagnostics: 1;MWHPR03MB2494;7:bsqzRaak4vJjsG5u1NE2tovqLGP9BupWFSXy7tdc4Bkh8knggFUrFua2WNrvSH7oKNsWb9I/syXmPzYXTjauLOsJ/IcS7WONVenfG7klfoNIYzBhggLjpZxrSvOGlIN8hIjG4KHvxXM0VhrRxurV0j5MI6knXT1tu58JqwGHY0uHuxKJuCBmZz4UZGUm4eDEM1ti1uFm0JKljpJIXeQSrNZhRg3cBAwAzAiXKyobRpePnRjNWDTpoA37dDdVzc85TOslKwRcCB96kaSOwgd1JfZSVBNZ3kGaFuFugGQZqFmV6r+UbQ0/SicT/QkHYlTHmMlT7sGt8NC9RriNSQF9hDHzz6nUtBjvOKYGbLod+/MGXZvUUD4+kJ8s+G+63vR2NZMRJ58GEJLynLP/Yyk105dpu+ZQA+p84zxklOzQrYStJMXuComzgztDWgFiLe2f+I8FND2L3lb2K3YIbUjFVR+Lwjdu5Kgw8ZHRQ+zvFzM= x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:; x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(61425038)(6040375)(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001)(6055026)(61426038)(61427038)(6041248)(20161123562025)(20161123564025)(20161123560025)(20161123555025)(6042181)(6072148)(6047074);SRVR:MWHPR03MB2494;BCL:0;PCL:0;RULEID:;SRVR:MWHPR03MB2494; x-forefront-prvs: 0157DEB61B x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(6009001)(7916002)(199003)(189002)(10090500001)(106356001)(10290500002)(86362001)(5005710100001)(99286002)(33656002)(3280700002)(3660700001)(8990500004)(5660300001)(97736004)(5001770100001)(105586002)(101416001)(86612001)(189998001)(7696004)(107886002)(2906002)(2900100001)(76576001)(6506006)(8936002)(74316002)(4001430100002)(6436002)(54356999)(9686002)(4326007)(50986999)(122556002)(2501003)(92566002)(38730400001)(77096006)(81156014)(102836003)(6116002)(81166006)(8676002)(2201001)(2171001)(25786008)(305945005)(68736007)(7736002);DIR:OUT;SFP:1102;SCL:1;SRVR:MWHPR03MB2494;H:MWHPR03MB2669.namprd03.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 X-OriginatorOrg: microsoft.com X-MS-Exchange-CrossTenant-originalarrivaltime: 15 Dec 2016 11:47:24.7818 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47 X-MS-Exchange-Transport-CrossTenantHeadersStamped: MWHPR03MB2494 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uBFC30Js020179 Content-Length: 2123 Lines: 54 Hi, when I run "mkfs.ext4 /dev/sdc2" in a Linux virtual machine on Hyper-V, where a disk IOPS=500 limit is applied by me [0], the command takes much more time, if the ext4 partition boundary is not properly aligned: Example 1 [1]: it takes ~7 minutes with average wMB/s = 0.3 (slow) Example 2 [2]: it takes ~3.5 minutes with average wMB/s = 0.6 (slow) Example 3 [3]: it takes ~0.5 minute with average wMB/s = 4 (expected) strace shows the mkfs.ext3 program calls seek()/write() a lot and most of the writes use 32KB buffers (this should be big enough), and the program only invokes fsync() once, after it issues all the writes -- the fsync() takes >99% of the time. By logging SCSI commands, the SCSI Write(10) command is used here for the userspace 32KB write: in example 1, *each* command writes 1 or 2 sectors only (1 sector = 512 bytes); in example 2, *each* command writes 2 or 4 sectors only; in example 3, each command writes 1024 sectors. It looks the kernel block I/O layer can somehow split big user-space buffers into really small write requests (1, 2, and 4 sectors)? This looks really strange to me. Note: in my test, this strange issue happens to 4.4 and the mainline 4.9 kernels, but the stable 3.18.45 kernel doesn't have the issue, i.e. all the 3 above test examples can finish in ~0.5 minute. Any comment? Thanks! -- Dexuan [0] The max IOPS are measured in 8KB increments, meaning the max throughput is 8KB * 500 = 4000KB. [1] This is the partition info of my 20GB disk: # fdisk -l /dev/sdc Disk /dev/sdc: 20 GiB, 21474836480 bytes, 41943040 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: dos Disk identifier: 0x00000000 Device Boot Start End Sectors Size Id Type /dev/sdc1 1 14281784 14281784 6.8G 82 Linux swap / Solaris /dev/sdc2 14281785 41929649 27647865 13.2G 83 Linux Here, start_sector = 14281785, end_sector = 41929649. [2] start_sector = 14282752, end_sector = 41929649 [3] start_sector = 14282752, end_sector = 41943039