Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754425AbbHFDYh (ORCPT ); Wed, 5 Aug 2015 23:24:37 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:61309 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752577AbbHFDYg (ORCPT ); Wed, 5 Aug 2015 23:24:36 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AsBwAu0sJV/20mLHlbgxuBPalQAQEBAQEHmycEAgKBSk0BAQEBAQGBC4QkAQEEJxMcIxAIAxgJJQ8FJQMhE4gtzHQBAQEBBgEBAQEeGYYGhTCBPQGDSweDGIEUBYcahmCHBoxUgUmHOJBtJoILEA8VgVAsMYJMAQEB Date: Thu, 6 Aug 2015 13:24:21 +1000 From: Dave Chinner To: Linda Knippers Cc: Jeff Moyer , "matthew r. wilcox" , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices" Message-ID: <20150806032421.GA16638@dastard> References: <20150805220113.GC3902@dastard> <55C2BB9E.3040709@hp.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55C2BB9E.3040709@hp.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4152 Lines: 97 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote: > On 08/05/2015 06:01 PM, Dave Chinner wrote: > > On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote: > >> Hi, Matthew, > >> > >> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs: > >> > >> # mkfs -t xfs -f /dev/pmem0 > >> meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks > >> = sectsz=512 attr=2, projid32bit=1 > >> = crc=0 finobt=0 > >> data = bsize=4096 blocks=2097152, imaxpct=25 > >> = sunit=0 swidth=0 blks > >> naming =version 2 bsize=4096 ascii-ci=0 ftype=0 > >> log =internal log bsize=4096 blocks=2560, version=2 > >> = sectsz=512 sunit=0 blks, lazy-count=1 > >> realtime =none extsz=4096 blocks=0, rtextents=0 > >> mkfs.xfs: read failed: Numerical result out of range > >> > >> I sat down with Linda to look into it, and the problem is that mkfs.xfs > >> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads > >> from the last sector of the device. This results in dax_io trying to do > >> a page-sized I/O at 512 bytes from the end of the device. > > > > Right - we have to be able to do IO to that last sector, so this is > > a sanity check to tell if the block dev is large enough. The XFS > > kernel code does the same end-of-device sector read when the > > filesystem is mounted, too. > > > >> bdev_direct_access, receiving this bogus pos/size combo, returns > >> -ERANGE: > >> > >> if ((sector + DIV_ROUND_UP(size, 512)) > > >> part_nr_sects_read(bdev->bd_part)) > >> return -ERANGE; > >> > >> Given that file systems supporting dax refuse to mount with a blocksize > >> != page size, I'm guessing this is sort of expected behavior. However, > >> we really shouldn't be breaking direct I/O on pmem devices. > > > > If the device is advertising 512 byte sector size support, then this > > needs to work, especially as DAX is completely transparent on the > > block device. Remember that DAX through a filesystem works on > > filesystem data block size boundaries, so a 512 byte sector/4k block > > size filesystem will be able to use DAX for mmapped files just fine. > > > >> So, what do you want to do? We could make the pmem device's logical > >> block size fixed at the sytem page size. Or, we could modify the dax > >> code to work with blocksize < pagesize. Or, we could continue using the > >> direct I/O codepath for direct block device access. What do you think? > > > > I don't know how the pmem device sets up it's limits. Can you post > > the output of: > > > > /sys/block/pmem0/queue/logical_block_size > 512 > > > /sys/block/pmem0/queue/physical_block_size > 512 > > > /sys/block/pmem0/queue/hw_sector_size > 512 > > > /sys/block/pmem0/queue/minimum_io_size > 512 > > > /sys/block/pmem0/queue/optimal_io_size > 0 Ok, so the pmem device is advertising 512 bytes for both physical and logical sector sizes. That means mkfs.xfs is not doing anything wrong. i.e. ERANGE on w read of the last sector of the block device is a bug in the block device code. It is not at all obvious from these sector sizes that the block device is DAX enabled. I'd suggest that you probably want to make the physical sector size 4k on x86-64 to indicate to filesystem utilities that 4k alignment of the filesystem is preferred, even if 512 byte IO can be supported in a less efficient manner (i.e. equivalent of a 512e hard drive).... You can't really make the logical sector size = PAGE_SIZE, because on 64k page size machines that will make the sector size larger than many filesystems support. e.g. XFS only supports sector sizes up to 32k at the moment... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/