Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758631AbZFYD6R (ORCPT ); Wed, 24 Jun 2009 23:58:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752084AbZFYD6E (ORCPT ); Wed, 24 Jun 2009 23:58:04 -0400 Received: from cantor.suse.de ([195.135.220.2]:44005 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751548AbZFYD6B (ORCPT ); Wed, 24 Jun 2009 23:58:01 -0400 From: Neil Brown To: Mike Snitzer , "Martin K. Petersen" , Linus Torvalds , Alasdair G Kergon , jens.axboe@oracle.com Date: Thu, 25 Jun 2009 13:58:31 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <19010.62951.886231.96622@notabene.brown> Cc: dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org Subject: REQUEST for new 'topology' metrics to be moved out of the 'queue' sysfs directory. X-Mailer: VM 7.19 under Emacs 21.4.1 X-face: [Gw_3E*Gng}4rRrKRYotwlE?.2|**#s9D X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6489 Lines: 145 Hi, I have (belatedly, I admit) been looking at the new 'topology' metrics that were introduced for 2.6.31. I have a few questions about them which I have been discussing with Martin, but there is one issue that I feel fairly strongly about and would like to see changed before it gets into a -final release. Hence this email to try to garner understanding and consensus. The various topology metrics are exposed to user-space through sysfs attributes in the 'queue' subdirectory of the block device (e.g. .../sda/queue). I think this is a poor choice and would like to find a better choice. To explain why, it probably helps to review the current situation. Before 2.6.30, 'queue' contains: hw_sector_size max_hw_sectors_kb nr_requests rq_affinity iosched/ max_sectors_kb read_ahead_kb scheduler iostats nomerges rotational Of these: max_hw_sectors_kb, nr_requests, rq_affinity, iosched/, max_sectors_kb scheduler nomerges rotational are really only relevant to the elevator code and those devices that used that code (ide, scsi, etc). They are not relevant for dm or md (md has it's own separate 'md' directory, and before 2.6.30, the '/queue' subdirectory did not even appear in dm or md devices). Of the others: hw_sector_size - is applicable to all block devices, and could reasonably be placed one level up in the device directory (along side 'size'). read_ahead_kb - a duplicate of bdi/read_ahead_kb iostats - is a switch to enable or disable accounting of statistics that are reported in the 'stat' file (one level up) So most of '/queue' is specific to one class of devices (admittedly a very large class). The others could be argued to be aberrations. Adding a number of extra fields such as minimum_io_size, optimal_io_size etc to '/queue' seems to increase the number of aberrations and enforces md and dm device to have a /queue which is largely irrelevant. One approach that we could take would be to hide all those fields in 'queue' that are not relevant to the current device, and let 'queue' be a 'dumping ground' for each block device to place whatever sysfs attributes they want (thus md would move all of md/* to queue/*, and leave 'md' as a symlink to 'queue'). I don't like this approach because it does not make best use of the name space. If 'md' and 'queue' have different directories, they are each free to create new attributes without risk of collision between different drivers - not that the collision would be a technical problem but it could be confusing to users. So, where to put these new fields? They could go in the device directory, along side 'size' and 'ro'. Like those fields, the new ones give guidance to filesystems on how to use the device. Whether or not this is a good thing depends a bit on how many fields we are talking about. One or two might be OK. 4 or more might look messy. There are currently 4 fields: logical_block_size, physical_block_size, minimum_io_size, optimal_io_size. I have suggested to Martin that 2 are enough. While I don't particularly want to debate that in this thread, it could be relevant so I'll outline my idea below. They could go in 'bdi' along with read_ahead_kb. read_ahead_kb gives guidance on optimal read sizes. The topology fields give guidance on optimal write sizes. There is a synergy there. And these fields are undeniably info about a backing device. NFS has it's own per-filesystem bdi so we would not want to impose fields on NFS that weren't at all relevant. NFS has 'rsize' and 'wsize' which are somewhat related. So I feel somewhat positive about this possibility. My only concern is that 'read_ahead_kb' is more about reading *files*, where as the *_block_size and *_io_size are about writing to the *device*. I'm not sure how important a difference this is. They could go in a new subdirectory of the block device, just like the integrity fields. e.g 'topology/'. or 'metrics/'. This would be my preferred approach if there do turn out to be the full 4 fields. Thanks for your attention. Comments most welcome. NeilBrown ---------------------- Alternate implementation with only two fields. According to Documentation/ABI/testing/sysfs-block, both physical_block_size and minimum_io_size are the smallest unit of IO that doesn't require read-modify-write. The first is thought to relate to drives with 4K blocks. The second to RAID5 arrays. But that doesn't make sense as it stands: you cannot have two things that are both the smallest. Also, minimum_io_size and optimal_io_size are both described as a "preferred" for IO - presumably writes, not reads. Again, we cannot have two values that are both preferred. There is again some suggestion that one is for disks and the other is for RAID, but I cannot see how a mkfs would choose between them. My conclusion is that there are two issues of importance. 1/ avoiding read-modify-write as that can affect correctness (When a write error happens, you can lose unrelated data). 2/ throughput. For each of these issues, there are a number of sizes that are relevant. e.g as you increase the request size, the performance can increase, but there a key points where a small increase in size can give a big increase in performance. These sizes might include block size, chunk size, stripe size, and cache size. So I suggested two fields, each of which can store multiple values: safe_write_size: 512 4096 327680 preferred_write_size: 4096 65536 327680 10485760 The guidance for using these is simple: When choosing a size where atomicity of writes is important, choose the largest size from safe_write_size which is practical (or a multiple there-of). When choosing a size which doesn't require atomicity, but where throughput is important, choose a multiple of the largest size from preferred_write_size which is practical. The smallest safe_write_size would be taken as the logical_block_size. If we just have these two fields, I would put them in the top level directory for the block device. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/