2014-11-05 14:00:15

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCHSET 0/5 v3] brd: partition fixes

Jens Hi

Currently brd has multiple bugs when trying to use partitions. After this
set all known problems are solved. Please see individual patch for description
of the problem.

[v3]
Same exact code but some commit messages changed to try and explain better
what was fixed and why. (Rebased on 3.18-rc3 but nothing changed in brd.c
since then)

[v2]
Based on Jens's linux-next [30e996a] incorporating the brd patch by Dmitry Monakhov.
Dmitry has introduced a new part_show parameter, this parameter is now removed
and we always "part_show=1".
Scripts that did part_show=1 will work just the same but will display a
message in logs. This is harmless. (And scripts can be modified to
remove this parameter)

[v1]
Current situation is that any attempt to use partitions with brd device would
create the partition but then any use will trash the data.

See: http://www.spinics.net/lists/linux-scsi/msg76737.html

So these patches fixes up all the problems we saw with the code, but not sacrificing
any of the old fixtures. See [patch 4/5] for more explanations.

list of patches:
[PATCH 1/5] axonram: Fix bug in direct_access
[PATCH 2/5] block: Change direct_access calling convention

These 2 above are from Matthew's DAX series latest version.
Exactly as is, taken from the 01org/prd.git tree
They are needed so to support direct_access with partitions.

[PATCH 3/5] brd: Fix all partitions BUGs

This one fixes all the very HARD bugs, which today
cause data corruption.

[PATCH 4/5] brd: Request from fdisk 4k alignment
[PATCH 5/5] brd: Add getgeo to block ops for fdisk

And the last two are to try and make fdisk work
properly with a direct_access() device.

One can fetch/view these patches from a public tree here:
git: git://git.open-osd.org/pmem.git brd-partitions branch
web: http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/brd-partitions

Thanks
Boaz


2014-11-05 14:01:14

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 1/5] axonram: Fix bug in direct_access

From: Matthew Wilcox <[email protected]>

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ad56edc..e8bb33b 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

return 0;
}
--
1.9.3

2014-11-05 14:02:47

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 2/5] block: Change direct_access calling convention

From: Matthew Wilcox <[email protected]>

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Boaz Harrosh <[email protected]>
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 17 ++++-------------
drivers/block/brd.c | 14 +++++++-------
drivers/s390/block/dcssblk.c | 21 +++++++++-----------
fs/block_dev.c | 40 +++++++++++++++++++++++++++++++++++++++
fs/ext2/xip.c | 31 +++++++++++++-----------------
include/linux/blkdev.h | 6 ++++--
7 files changed, 86 insertions(+), 58 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.

-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.

The block device operation is optional, these block devices support it as of
today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index e8bb33b..4afff8d 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
- loff_t offset;
-
- offset = sector;
- if (device->bd_part != NULL)
- offset += device->bd_part->start_sect;
- offset <<= AXON_RAM_SECTOR_SHIFT;
- if (offset >= bank->size) {
- dev_err(&bank->device->dev, "Access outside of address space\n");
- return -ERANGE;
- }
+ loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return bank->size - offset;
}

static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3598110..89e90ec 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,25 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}

#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;

if (!brd)
return -ENODEV;
- if (sector & (PAGE_SECTORS-1))
- return -EINVAL;
- if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
- return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- return 0;
+ /*
+ * TODO: If size > PAGE_SIZE, we could look to see if the next page in
+ * the file happens to be mapped to the next page of physical RAM.
+ */
+ return PAGE_SIZE;
}
#endif

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -866,25 +866,22 @@ fail:
bio_io_error(bio);
}

-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;

dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
- if (secnum % (PAGE_SIZE/512))
- return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
- return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ dev_sz = dev_info->end - dev_info->start;
+ offset = secnum * 512;
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return dev_sz - offset;
}

static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1d9c9f3..866b182 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -423,6 +423,46 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(bdev_write_page);

+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ * @bdev: The device containing the memory
+ * @sector: The offset within the device
+ * @addr: Where to put the address of the memory
+ * @pfn: The Page Frame Number for the memory
+ * @size: The number of bytes requested
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory. The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar. The PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+ void **addr, unsigned long *pfn, long size)
+{
+ long avail;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+
+ if (size < 0)
+ return size;
+ if (!ops->direct_access)
+ return -EOPNOTSUPP;
+ if ((sector + DIV_ROUND_UP(size, 512)) >
+ part_nr_sects_read(bdev->bd_part))
+ return -ERANGE;
+ sector += get_start_sect(bdev);
+ if (sector % (PAGE_SIZE / 512))
+ return -EINVAL;
+ avail = ops->direct_access(bdev, sector, addr, pfn, size);
+ if (!avail)
+ return -ERANGE;
+ return min(avail, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
/*
* pseudo-fs
*/
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
#include "ext2.h"
#include "xip.h"

-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return bdev_direct_access(bdev, sector, kaddr, pfn, size);
}

static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;

- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}

void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;

/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aac0f9e..ca8a44b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1603,8 +1603,8 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access)(struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1622,6 +1622,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+ unsigned long *pfn, long size);
#else /* CONFIG_BLOCK */

struct block_device;
--
1.9.3

2014-11-05 14:04:59

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 3/5] brd: Fix all partitions BUGs

From: Boaz Harrosh <[email protected]>

This patch fixes up brd's partitions scheme, now enjoying all worlds.

The MAIN fix here is that currently, if one fdisks some partitions,
a BAD bug will make all partitions point to the same start-end sector
ie: 0 - brd_size And an mkfs of any partition would trash the partition
table and the other partition.

Another fix is that "mount -U uuid" will not work if show_part was not
specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag.
We now always load without it and remove the show_part parameter.

[We remove Dmitry's new module-param part_show it is now always
show]

So NOW the logic goes like this:
* max_part - Just says how many minors to reserve between ramX
devices. In any way, there can be as many partition as requested.
If minors between devices ends, then dynamic 259-major ids will
be allocated on the fly.
The default is now max_part=1, which means all partitions devt(s)
will be from the dynamic (259) major-range.
(If persistent partition minors is needed use max_part=X)
For example with /dev/sdX max_part is hard coded 16.

* Creation of new devices on the fly still/always work:
mknod /path/devnod b 1 X
fdisk -l /path/devnod
Will create a new device if [X / max_part] was not already
created before. (Just as before)

partitions on the dynamically created device will work as well
Same logic applies with minors as with the pre-created ones.

TODO: dynamic grow of device size. So each device can have it's
own size.

CC: Dmitry Monakhov <[email protected]>
Tested-by: Ross Zwisler <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
drivers/block/brd.c | 100 ++++++++++++++++++++--------------------------------
1 file changed, 38 insertions(+), 62 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 89e90ec..a7463c959 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -438,19 +438,18 @@ static const struct block_device_operations brd_fops = {
/*
* And now the modules code and kernel interface.
*/
-static int rd_nr;
-int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
-static int max_part;
-static int part_shift;
-static int part_show = 0;
+static int rd_nr = CONFIG_BLK_DEV_RAM_COUNT;
module_param(rd_nr, int, S_IRUGO);
MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
+
+int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
module_param(rd_size, int, S_IRUGO);
MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
+
+static int max_part = 1;
module_param(max_part, int, S_IRUGO);
-MODULE_PARM_DESC(max_part, "Maximum number of partitions per RAM disk");
-module_param(part_show, int, S_IRUGO);
-MODULE_PARM_DESC(part_show, "Control RAM disk visibility in /proc/partitions");
+MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
+
MODULE_LICENSE("GPL");
MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
MODULE_ALIAS("rd");
@@ -496,16 +495,15 @@ static struct brd_device *brd_alloc(int i)
brd->brd_queue->limits.discard_zeroes_data = 1;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);

- disk = brd->brd_disk = alloc_disk(1 << part_shift);
+ disk = brd->brd_disk = alloc_disk(max_part);
if (!disk)
goto out_free_queue;
disk->major = RAMDISK_MAJOR;
- disk->first_minor = i << part_shift;
+ disk->first_minor = i * max_part;
disk->fops = &brd_fops;
disk->private_data = brd;
disk->queue = brd->brd_queue;
- if (!part_show)
- disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+ disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);

@@ -527,10 +525,11 @@ static void brd_free(struct brd_device *brd)
kfree(brd);
}

-static struct brd_device *brd_init_one(int i)
+static struct brd_device *brd_init_one(int i, bool *new)
{
struct brd_device *brd;

+ *new = false;
list_for_each_entry(brd, &brd_devices, brd_list) {
if (brd->brd_number == i)
goto out;
@@ -541,6 +540,7 @@ static struct brd_device *brd_init_one(int i)
add_disk(brd->brd_disk);
list_add_tail(&brd->brd_list, &brd_devices);
}
+ *new = true;
out:
return brd;
}
@@ -556,70 +556,46 @@ static struct kobject *brd_probe(dev_t dev, int *part, void *data)
{
struct brd_device *brd;
struct kobject *kobj;
+ bool new;

mutex_lock(&brd_devices_mutex);
- brd = brd_init_one(MINOR(dev) >> part_shift);
+ brd = brd_init_one(MINOR(dev) / max_part, &new);
kobj = brd ? get_disk(brd->brd_disk) : NULL;
mutex_unlock(&brd_devices_mutex);

- *part = 0;
+ if (new)
+ *part = 0;
+
return kobj;
}

static int __init brd_init(void)
{
- int i, nr;
- unsigned long range;
struct brd_device *brd, *next;
+ int i;

/*
* brd module now has a feature to instantiate underlying device
* structure on-demand, provided that there is an access dev node.
- * However, this will not work well with user space tool that doesn't
- * know about such "feature". In order to not break any existing
- * tool, we do the following:
*
- * (1) if rd_nr is specified, create that many upfront, and this
- * also becomes a hard limit.
- * (2) if rd_nr is not specified, create CONFIG_BLK_DEV_RAM_COUNT
- * (default 16) rd device on module load, user can further
- * extend brd device by create dev node themselves and have
- * kernel automatically instantiate actual device on-demand.
+ * (1) if rd_nr is specified, create that many upfront. else
+ * it defaults to CONFIG_BLK_DEV_RAM_COUNT
+ * (2) User can further extend brd devices by create dev node themselves
+ * and have kernel automatically instantiate actual device
+ * on-demand. Example:
+ * mknod /path/devnod_name b 1 X # 1 is the rd major
+ * fdisk -l /path/devnod_name
+ * If (X / max_part) was not already created it will be created
+ * dynamically.
*/

- part_shift = 0;
- if (max_part > 0) {
- part_shift = fls(max_part);
-
- /*
- * Adjust max_part according to part_shift as it is exported
- * to user space so that user can decide correct minor number
- * if [s]he want to create more devices.
- *
- * Note that -1 is required because partition 0 is reserved
- * for the whole disk.
- */
- max_part = (1UL << part_shift) - 1;
- }
-
- if ((1UL << part_shift) > DISK_MAX_PARTS)
- return -EINVAL;
-
- if (rd_nr > 1UL << (MINORBITS - part_shift))
- return -EINVAL;
-
- if (rd_nr) {
- nr = rd_nr;
- range = rd_nr << part_shift;
- } else {
- nr = CONFIG_BLK_DEV_RAM_COUNT;
- range = 1UL << MINORBITS;
- }
-
if (register_blkdev(RAMDISK_MAJOR, "ramdisk"))
return -EIO;

- for (i = 0; i < nr; i++) {
+ if (unlikely(!max_part))
+ max_part = 1;
+
+ for (i = 0; i < rd_nr; i++) {
brd = brd_alloc(i);
if (!brd)
goto out_free;
@@ -631,10 +607,10 @@ static int __init brd_init(void)
list_for_each_entry(brd, &brd_devices, brd_list)
add_disk(brd->brd_disk);

- blk_register_region(MKDEV(RAMDISK_MAJOR, 0), range,
+ blk_register_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS,
THIS_MODULE, brd_probe, NULL, NULL);

- printk(KERN_INFO "brd: module loaded\n");
+ pr_info("brd: module loaded\n");
return 0;

out_free:
@@ -644,21 +620,21 @@ out_free:
}
unregister_blkdev(RAMDISK_MAJOR, "ramdisk");

+ pr_info("brd: module NOT loaded !!!\n");
return -ENOMEM;
}

static void __exit brd_exit(void)
{
- unsigned long range;
struct brd_device *brd, *next;

- range = rd_nr ? rd_nr << part_shift : 1UL << MINORBITS;
-
list_for_each_entry_safe(brd, next, &brd_devices, brd_list)
brd_del_one(brd);

- blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), range);
+ blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS);
unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+ pr_info("brd: module unloaded\n");
}

module_init(brd_init);
--
1.9.3

2014-11-05 14:08:24

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 4/5] brd: Request from fdisk 4k alignment

From: Boaz Harrosh <[email protected]>

Because of the direct_access() API which returns a PFN. partitions
better start on 4K boundary, else offset ZERO of a partition will
not be aligned and blk_direct_access() will fail the call.

By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate
this to fdisk and friends.

Note that blk_queue_physical_block_size() also trashes io_min, but
we can leave this one to be 512. io_min as opposed to physical_block_size
will actually change Kernel behavior, but physical_block_size is not
used by Kernel except being exported to user-mode.

before this patch running fdisk on a default size brd of 4M
the first sector offered is 34 (BAD), but after this patch it
will be 40, ie 8 sectors aligned. Also when entering some random
partition sizes the next partition-start sector is offered 8 sectors
aligned after this patch. (Please note that with fdisk the user
can still enter bad values, only the offered default values will
be correct)

Signed-off-by: Boaz Harrosh <[email protected]>
---
drivers/block/brd.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a7463c959..0026998 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -486,10 +486,17 @@ static struct brd_device *brd_alloc(int i)
brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
if (!brd->brd_queue)
goto out_free_dev;
+
blk_queue_make_request(brd->brd_queue, brd_make_request);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);

+ /* This is so fdisk will align partitions on 4k, because of
+ * direct_access API needing 4k alignment, returning a PFN
+ */
+ blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);
+ brd->brd_queue->limits.io_min = 512; /* Don't use the accessor */
+
brd->brd_queue->limits.discard_granularity = PAGE_SIZE;
brd->brd_queue->limits.max_discard_sectors = UINT_MAX;
brd->brd_queue->limits.discard_zeroes_data = 1;
--
1.9.3

2014-11-05 14:10:57

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 5/5] brd: Add getgeo to block ops for fdisk

From: Boaz Harrosh <[email protected]>

Now when fdisk is run on brd it will ask some cryptic
questions about CHS. This is because the getgeo block operation
is not implemented.

With the usual emulated values of 64, 32, X and the default
size of 4M device, fdisk will offer 34 as possible first-sector,
And after the previous patch of 4K sectores it will offer
40 as possible first-sector.

But with the values presented here of 1, 1, 1 fdisk will offer
8 (4K) as possible first sector.

I have done a quick audit of the fdisk code. The CHS calculation
is very convoluted but at the end it comes out with a number.
Which is taken into consideration in first-sector to allow. With
all 1(s) this numbers is very small and other numbers come into
account. Also note that if the device is big like 1G (not sure what
is the threshold) fdisk will offer 1M (2048) as possible first-
sector, and it does not matter what numbers we give here.

Only fdisk is problematic both libgparted (and all applications
that use it) and cfdisk always offer 1M alignment and do not call
getgeo at all.

Signed-off-by: Boaz Harrosh <[email protected]>
---
drivers/block/brd.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0026998..e0bc072 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -19,6 +19,7 @@
#include <linux/radix-tree.h>
#include <linux/fs.h>
#include <linux/slab.h>
+#include <linux/hdreg.h>

#include <asm/uaccess.h>

@@ -426,6 +427,23 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,
return error;
}

+static int brd_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+ /* Just tell fdisk to get out of the way. The math here is so
+ * convoluted and does not make any sense at all. With all 1s
+ * The math just gets out of the way.
+ * NOTE: I was trying to get some values that will make fdisk
+ * Want to align first sector on 4K (like 8, 16, 20, ... sectors) but
+ * nothing worked, I searched the net the math is not your regular
+ * simple multiplication at all. If you managed to get these please
+ * fix here. For now we use 4k physical sectors for this
+ */
+ geo->heads = 1;
+ geo->sectors = 1;
+ geo->cylinders = 1;
+ return 0;
+}
+
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
@@ -433,6 +451,7 @@ static const struct block_device_operations brd_fops = {
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
#endif
+ .getgeo = brd_getgeo,
};

/*
--
1.9.3

2014-11-05 14:20:30

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 4/5] brd: Request from fdisk 4k alignment

>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:

Hey Boaz,

Boaz> Note that blk_queue_physical_block_size() also trashes io_min, but
Boaz> we can leave this one to be 512. io_min as opposed to
Boaz> physical_block_size will actually change Kernel behavior

Care to elaborate?

Also, fdisk and parted should be aligning on 1MB by default regardless
of block size. So I'm also not sure why you need to trick fdisk to align
by setting the pbs.

--
Martin K. Petersen Oracle Linux Engineering

2014-11-05 14:48:59

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 4/5] brd: Request from fdisk 4k alignment

On 11/05/2014 04:20 PM, Martin K. Petersen wrote:
>>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:
>
> Hey Boaz,
>
> Boaz> Note that blk_queue_physical_block_size() also trashes io_min, but
> Boaz> we can leave this one to be 512. io_min as opposed to
> Boaz> physical_block_size will actually change Kernel behavior
>
> Care to elaborate?
>
> Boaz> before this patch running fdisk on a default size brd of 4M
> Boaz> the first sector offered is 34 (BAD), but after this patch it
> Boaz> will be 40, ie 8 sectors aligned. Also when entering some random
> Boaz> partition sizes the next partition-start sector is offered 8 sectors
> Boaz> aligned after this patch. (Please note that with fdisk the user
> Boaz> can still enter bad values, only the offered default values will
> Boaz> be correct)

I did elaborate, Please try with these numbers above and see for
yourself. (Summery brd-size=4M before this patch will offer 34 as
first-sector, after 40)

> Also, fdisk and parted should be aligning on 1MB by default regardless
> of block size. So I'm also not sure why you need to trick fdisk to align
> by setting the pbs.
>

fdisk will not align partition-start on 1M, only the very first-sector,
libgparted and cfdisk are as you say, but with fdisk if you put any odd
numbered partition-size, the next start-sector will just be the sum, but
with this patch it will offer a small hole and will align the
partition-start on 4K.
(And even with this patch user can enter bad values)

Please believe me that I did not try to invent work for myself. I wanted
to fix the partition bugs, with the use of direct_access and to my dismay
I found that it is very easy to have broken partition alignments. With this
patch it is still possible but mach harder to do so.

Thanks
Boaz

2014-11-05 15:14:23

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 5/5 v4] brd: Add getgeo to block ops for fdisk

From: Boaz Harrosh <[email protected]>

Now when fdisk is run on brd it will ask some cryptic
questions about CHS. This is because the getgeo block operation
is not implemented.

With the usual emulated values of 64, 32, X and the default
size of 4M device, fdisk will offer 34 as possible first-sector,
And after the previous patch of 4K sectores it will offer
40 as possible first-sector.

But with the values presented here of 1, 1, 1 fdisk will offer
8 (4K) as possible first sector.

I have done a quick audit of the fdisk code. The CHS calculation
is very convoluted but at the end it comes out with a number.
Which is taken into consideration in first-sector to allow. With
all 1(s) this numbers is very small and other numbers come into
account. Also note that if the device is big like 1G (not sure what
is the threshold) fdisk will offer 1M (2048) as possible first-
sector, and it does not matter what numbers we give here.

Only fdisk is problematic both libgparted (and all applications
that use it) and cfdisk always offer 1M alignment and do not call
getgeo at all.

Signed-off-by: Boaz Harrosh <[email protected]>
---
drivers/block/brd.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0026998..7965bf9 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -19,6 +19,7 @@
#include <linux/radix-tree.h>
#include <linux/fs.h>
#include <linux/slab.h>
+#include <linux/hdreg.h>

#include <asm/uaccess.h>

@@ -426,6 +427,22 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,
return error;
}

+static int brd_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+ /* With the usual emulated values of 64, 32, X and the
+ * default size of 4M device, fdisk will offer 40 as possible
+ * first-sector.
+ * But with the values presented here of 1, 1, 1 fdisk will
+ * offer 8 (4K) as possible first sector.
+ * Otherwise, either way, it effects nothing. Only fdisk calls
+ * this vector at all.
+ */
+ geo->heads = 1;
+ geo->sectors = 1;
+ geo->cylinders = 1;
+ return 0;
+}
+
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
@@ -433,6 +450,7 @@ static const struct block_device_operations brd_fops = {
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
#endif
+ .getgeo = brd_getgeo,
};

/*
--
1.9.3

2014-11-05 15:18:21

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 5/5 v4] brd: Add getgeo to block ops for fdisk

On 11/05/2014 05:14 PM, Boaz Harrosh wrote:
> From: Boaz Harrosh <[email protected]>
>
Sorry

[V4]
Also fix the cryptic comment at the source code.

I have pushed new trees to the public tree as well

Thanks
Boaz

<>
> +static int brd_getgeo(struct block_device *bd, struct hd_geometry *geo)
> +{
> + /* With the usual emulated values of 64, 32, X and the
> + * default size of 4M device, fdisk will offer 40 as possible
> + * first-sector.
> + * But with the values presented here of 1, 1, 1 fdisk will
> + * offer 8 (4K) as possible first sector.
> + * Otherwise, either way, it effects nothing. Only fdisk calls
> + * this vector at all.
> + */
> + geo->heads = 1;
> + geo->sectors = 1;
> + geo->cylinders = 1;
> + return 0;
> +}
> +
> static const struct block_device_operations brd_fops = {
> .owner = THIS_MODULE,
> .rw_page = brd_rw_page,
> @@ -433,6 +450,7 @@ static const struct block_device_operations brd_fops = {
> #ifdef CONFIG_BLK_DEV_XIP
> .direct_access = brd_direct_access,
> #endif
> + .getgeo = brd_getgeo,
> };
>
> /*
>

2014-11-06 17:26:19

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 4/5] brd: Request from fdisk 4k alignment

>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:

Boaz,

Boaz> Note that blk_queue_physical_block_size() also trashes io_min, but
Boaz> we can leave this one to be 512. io_min as opposed to
Boaz> physical_block_size will actually change Kernel behavior

>> Care to elaborate?

I wanted you to elaborate on how io_min changes kernel behavior.

>> Also, fdisk and parted should be aligning on 1MB by default
>> regardless of block size. So I'm also not sure why you need to trick
>> fdisk to align by setting the pbs.

Boaz> fdisk will not align partition-start on 1M, only the very
Boaz> first-sector, libgparted and cfdisk are as you say, but with fdisk
Boaz> if you put any odd numbered partition-size, the next start-sector
Boaz> will just be the sum, but with this patch it will offer a small
Boaz> hole and will align the partition-start on 4K. (And even with
Boaz> this patch user can enter bad values)

It really sounds like either your fdisk is way too old or you are
running it in DOS compat mode.

I don't have a fundamental issue reporting pbs of 4K. But if you are
only doing it to force a certain partition alignment then it sounds like
a kernel fix for a userland problem.

I've CC:ed Karel who can comment on fdisk partition alignment issues.

--
Martin K. Petersen Oracle Linux Engineering

2014-11-07 09:10:43

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH 4/5] brd: Request from fdisk 4k alignment

On Thu, Nov 06, 2014 at 12:25:56PM -0500, Martin K. Petersen wrote:
> >>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:
>
> Boaz,
>
> Boaz> Note that blk_queue_physical_block_size() also trashes io_min, but
> Boaz> we can leave this one to be 512. io_min as opposed to
> Boaz> physical_block_size will actually change Kernel behavior
>
> >> Care to elaborate?
>
> I wanted you to elaborate on how io_min changes kernel behavior.
>
> >> Also, fdisk and parted should be aligning on 1MB by default
> >> regardless of block size. So I'm also not sure why you need to trick
> >> fdisk to align by setting the pbs.
>
> Boaz> fdisk will not align partition-start on 1M, only the very
> Boaz> first-sector, libgparted and cfdisk are as you say, but with fdisk
> Boaz> if you put any odd numbered partition-size, the next start-sector
> Boaz> will just be the sum, but with this patch it will offer a small
> Boaz> hole and will align the partition-start on 4K. (And even with
> Boaz> this patch user can enter bad values)
>
> It really sounds like either your fdisk is way too old or you are
> running it in DOS compat mode.
>
> I don't have a fundamental issue reporting pbs of 4K. But if you are
> only doing it to force a certain partition alignment then it sounds like
> a kernel fix for a userland problem.
>
> I've CC:ed Karel who can comment on fdisk partition alignment issues.

fdisk by default aligns partitions and uses 1MiB offset (for first
primary and all logical partitions). It's low-level tool for experts,
so if you ask for non-aligned offsets it will follow your wishes of
course. The default values in dialogs are always aligned. All
misaligned partitions are always reported (by warning messages).

Boaz, it would be nice to have any example (copy & past fdisk output).

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2014-11-07 10:00:16

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH 5/5] brd: Add getgeo to block ops for fdisk

On Wed, Nov 05, 2014 at 04:10:50PM +0200, Boaz Harrosh wrote:
> From: Boaz Harrosh <[email protected]>
>
> Now when fdisk is run on brd it will ask some cryptic
> questions about CHS. This is because the getgeo block operation
> is not implemented.

Again, what fdisk (util-linux) version?

> I have done a quick audit of the fdisk code. The CHS calculation
> is very convoluted but at the end it comes out with a number.
> Which is taken into consideration in first-sector to allow. With
> all 1(s) this numbers is very small and other numbers come into
> account. Also note that if the device is big like 1G (not sure what
> is the threshold) fdisk will offer 1M (2048) as possible first-
> sector, and it does not matter what numbers we give here.

oh.. sounds like archeology, CHS calculation does not mater in the
current code, it follows I/O limits (including crazy alignment
offset).

Karel


--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2014-11-09 16:57:32

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 5/5] brd: Add getgeo to block ops for fdisk

On 11/07/2014 11:23 AM, Karel Zak wrote:
> On Wed, Nov 05, 2014 at 04:10:50PM +0200, Boaz Harrosh wrote:
>> From: Boaz Harrosh <[email protected]>
>>
>> Now when fdisk is run on brd it will ask some cryptic
>> questions about CHS. This is because the getgeo block operation
>> is not implemented.
>
> Again, what fdisk (util-linux) version?
>

Hi Karel, Dave and Jens

With Fedora 20 > "fdisk from util-linux 2.24.2"

>> I have done a quick audit of the fdisk code. The CHS calculation
>> is very convoluted but at the end it comes out with a number.
>> Which is taken into consideration in first-sector to allow. With
>> all 1(s) this numbers is very small and other numbers come into
>> account. Also note that if the device is big like 1G (not sure what
>> is the threshold) fdisk will offer 1M (2048) as possible first-
>> sector, and it does not matter what numbers we give here.
>
> oh.. sounds like archeology, CHS calculation does not mater in the
> current code, it follows I/O limits (including crazy alignment
> offset).
>

with a small 4M disk

I see brd_getgeo() getting called on fdisk load and when pressing
g or o. But it no longer has any effect at all if I have it defined
returning CHS(4,64,32) or returning CHS(1,1,1) or not defined at all
I get the same exact below experience:
(Dropping the none-relevant prompts)

======== *WITHOUT* 4k physical_block_size PATCH ===============================================================
Command (m for help): g
Command (m for help): n
First sector (34-8158, default 34):
Last sector, +sectors or +size{K,M,G,T,P} (34-8158, default 8158): 1717
Command (m for help): n
First sector (1718-8158, default 1718):
# NOTE first sector is last "prev-end" + 1
Last sector, +sectors or +size{K,M,G,T,P} (1718-8158, default 8158):
Command (m for help): p
Disk /dev/ram0: 4 MiB, 4194304 bytes, 8192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 5C14F69F-EA44-4FE3-984A-9948D6AB7628

Device Start End Size Type
/dev/ram0p1 34 1717 842K Linux filesystem
/dev/ram0p2 1718 8158 3.1M Linux filesystem

---

Command (m for help): o
Command (m for help): n
First sector (1-8191, default 1):
Last sector, +sectors or +size{K,M,G,T,P} (1-8191, default 8191): 1717
Command (m for help): n
First sector (1718-8191, default 1718):
# NOTE same here
Last sector, +sectors or +size{K,M,G,T,P} (1718-8191, default 8191):
Command (m for help): p
Disk /dev/ram0: 4 MiB, 4194304 bytes, 8192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xe5c9e1b1

Device Boot Start End Blocks Id System
/dev/ram0p1 1 1717 858+ 83 Linux
/dev/ram0p2 1718 8191 3237 83 Linux

======== 4k physical_block_size PATCH ===============================================================

Command (m for help): g
Command (m for help): n
First sector (34-8158, default 40):
Last sector, +sectors or +size{K,M,G,T,P} (40-8158, default 8158): 1717
Command (m for help): n
First sector (34-8158, default 1720):
# NOTE 34-8158 again, only with gpt
Last sector, +sectors or +size{K,M,G,T,P} (1720-8158, default 8158):
Command (m for help): p
Disk /dev/ram0: 4 MiB, 4194304 bytes, 8192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 892E3C8A-3942-4C06-9B5B-6BA6B5A84AB9

Device Start End Size Type
/dev/ram0p1 40 1717 839K Linux filesystem
/dev/ram0p2 1720 8158 3.1M Linux filesystem

---

Command (m for help): o
Command (m for help): n
First sector (8-8191, default 8):
Last sector, +sectors or +size{K,M,G,T,P} (8-8191, default 8191): 1717
Command (m for help): n
First sector (1718-8191, default 1720):
# NOTE with dos its good
Last sector, +sectors or +size{K,M,G,T,P} (1720-8191, default 8191):
Command (m for help): p
Disk /dev/ram0: 4 MiB, 4194304 bytes, 8192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xaa069afe

Device Boot Start End Blocks Id System
/dev/ram0p1 8 1717 855 83 Linux
/dev/ram0p2 1720 8191 3236 83 Linux

=======

So I'm dropping the getgeo() patch all together. It is now fixed with new
util-linux 2.24.2 and is not needed at all. (Before fdisk gave me a prompt
on load without it)

Jens please drop this patch titled:
[PATCH 5/5] brd: Add getgeo to block ops for fdisk

Dave, Karel, what would you say fdisk should do? do you think it behaves correctly
to only align with the 4k-physical_block_size or must it always align ?

So it looks like we still need the 4k-physical_block_size PATCH for current
fdisk, I don't mind if it is decided to be fixed in user-mode then we can
drop this patch as well.

[ For Karel 2 things:
- It looks like getgeo has no effect (Is it used at all?), you might want to
remove the call. (libgparted does not use it)
- Checkout the small bug above with gpt and first-sector of second partition
]

Many Thanks
Boaz

2014-11-09 17:52:25

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 4/5] brd: Request from fdisk 4k alignment

On 11/06/2014 07:25 PM, Martin K. Petersen wrote:
>>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:
>
> Boaz,
>
> Boaz> Note that blk_queue_physical_block_size() also trashes io_min, but
> Boaz> we can leave this one to be 512. io_min as opposed to
> Boaz> physical_block_size will actually change Kernel behavior
>
>>> Care to elaborate?
>
> I wanted you to elaborate on how io_min changes kernel behavior.
>

OK You are correct. io_min changes behavior of Kernel in exactly the same
way as physical_block_size.

Through the call to queue_limit_alignment_offset() inside add_partition()
by setting hd_struct->alignment_offset

I do not know why I thought that only io_min does that, I can see now that
both effect the Kernel the same way. Which scares me a bit.

Will I have problems?

<>
>
> It really sounds like either your fdisk is way too old or you are
> running it in DOS compat mode.
>
> I don't have a fundamental issue reporting pbs of 4K. But if you are
> only doing it to force a certain partition alignment then it sounds like
> a kernel fix for a userland problem.
>
> I've CC:ed Karel who can comment on fdisk partition alignment issues.
>

Thanks Martin. I agree, we should not fix user-mode problems. Please see
the other email with the exact numbers I get from fdisk. It looks like
when everything is 512 like default it would not align my partitions, but
with the 4k-phisical thing it would. What is the expected behavior we want?

Thanks
Boaz

2014-11-10 09:59:05

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH 5/5] brd: Add getgeo to block ops for fdisk

On Sun, Nov 09, 2014 at 06:57:27PM +0200, Boaz Harrosh wrote:
> with a small 4M disk

This is the problem, for small disks (<= 4MiB) we don't use 1MiB
grain because it does not make sense. For so small devices the grain
(and the first partition offset) is the same as physical sector size.

Try something bigger in your tests :-)

> I see brd_getgeo() getting called on fdisk load and when pressing
> g or o. But it no longer has any effect at all if I have it defined
> returning CHS(4,64,32) or returning CHS(1,1,1) or not defined at all
> I get the same exact below experience:

The alignment and topology code is generic, it gathers all
information about the device, but it's fine if the device does not
provide HDIO_GETGEO. The geometry is currently used for DOS
compatible mode, or SGI and SUN only.

> ======== 4k physical_block_size PATCH ===============================================================
>
> Command (m for help): g
> Command (m for help): n
> First sector (34-8158, default 40):
> Last sector, +sectors or +size{K,M,G,T,P} (40-8158, default 8158): 1717
> Command (m for help): n
> First sector (34-8158, default 1720):
> # NOTE 34-8158 again, only with gpt

It seems like a fdisk bug, the gap between 34-40 is smaller than phy sector
size.. I'll fix it, thanks!

> Dave, Karel, what would you say fdisk should do? do you think it behaves correctly
> to only align with the 4k-physical_block_size or must it always align ?

If you expect partitions aligned to 4K (~pagesize?) then you have to
provide proper information to userspace (sector size or min/opt_io),
so from my point of view the patch makes sense.

Note that for example zram uses 4K logical and physical sector sizes
as well as all I/O limits are aligned to 4K at all:

# modprobe zram
# zramctl --find --size 4MiB
/dev/zram0

# fdisk -l /dev/zram0
Disk /dev/zram0: 4 MiB, 4194304 bytes, 1024 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Is there a reason for brd to behave differently?

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2014-11-10 11:15:45

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 5/5] brd: Add getgeo to block ops for fdisk

On 11/10/2014 11:58 AM, Karel Zak wrote:
> On Sun, Nov 09, 2014 at 06:57:27PM +0200, Boaz Harrosh wrote:
>> with a small 4M disk
>
> This is the problem, for small disks (<= 4MiB) we don't use 1MiB
> grain because it does not make sense. For so small devices the grain
> (and the first partition offset) is the same as physical sector size.
>
> Try something bigger in your tests :-)
>

OK thanks, I saw that. Only I was not sure what is the threshold.
just that 4MiB is the default for brd at Kconfig so I naturally hit it.

>> I see brd_getgeo() getting called on fdisk load and when pressing
>> g or o. But it no longer has any effect at all if I have it defined
>> returning CHS(4,64,32) or returning CHS(1,1,1) or not defined at all
>> I get the same exact below experience:
>
> The alignment and topology code is generic, it gathers all
> information about the device, but it's fine if the device does not
> provide HDIO_GETGEO. The geometry is currently used for DOS
> compatible mode, or SGI and SUN only.
>

OK cool. So we should just remove it. Thanks for confirming.
(When I started all this, fdisk used to prompt for these values which started
the all land slide. Cool one problem fixed, me happy)

>> ======== 4k physical_block_size PATCH ===============================================================
>>
>> Command (m for help): g
>> Command (m for help): n
>> First sector (34-8158, default 40):
>> Last sector, +sectors or +size{K,M,G,T,P} (40-8158, default 8158): 1717
>> Command (m for help): n
>> First sector (34-8158, default 1720):
>> # NOTE 34-8158 again, only with gpt
>
> It seems like a fdisk bug, the gap between 34-40 is smaller than phy sector
> size.. I'll fix it, thanks!
>

Please note that the final output needs to be 1718-8158 not
40-8158. I was not sure if you noticed from your explanation above.

>> Dave, Karel, what would you say fdisk should do? do you think it behaves correctly
>> to only align with the 4k-physical_block_size or must it always align ?
>
> If you expect partitions aligned to 4K (~pagesize?) then you have to
> provide proper information to userspace (sector size or min/opt_io),
> so from my point of view the patch makes sense.
>
> Note that for example zram uses 4K logical and physical sector sizes
> as well as all I/O limits are aligned to 4K at all:
>

In theory this is a great idea and makes perfect sense. Even the
page addressing instead of the sector addressing. But what I'm afraid
will happen at Kernel is that it will attempt to do read-modify-write
on small writes, but with memory technology this makes no sense.
Actually memory is *byte* addressable.

It is good for zram because I'd guess it compresses each page so the
read-modify-write done by Kernel is good for zram, less coding.

> # modprobe zram
> # zramctl --find --size 4MiB
> /dev/zram0
>
> # fdisk -l /dev/zram0
> Disk /dev/zram0: 4 MiB, 4194304 bytes, 1024 sectors
> Units: sectors of 1 * 4096 = 4096 bytes
> Sector size (logical/physical): 4096 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>
>
> Is there a reason for brd to behave differently?
>
> Karel
>

Thanks Karel
Boaz

2014-11-10 13:26:50

by Karel Zak

[permalink] [raw]
Subject: Re: [PATCH 5/5] brd: Add getgeo to block ops for fdisk

On Mon, Nov 10, 2014 at 01:15:34PM +0200, Boaz Harrosh wrote:
> >> Command (m for help): n
> >> First sector (34-8158, default 40):
> >> Last sector, +sectors or +size{K,M,G,T,P} (40-8158, default 8158): 1717
> >> Command (m for help): n
> >> First sector (34-8158, default 1720):
> >> # NOTE 34-8158 again, only with gpt
> >
> > It seems like a fdisk bug, the gap between 34-40 is smaller than phy sector
> > size.. I'll fix it, thanks!
> >
>
> Please note that the final output needs to be 1718-8158 not
> 40-8158. I was not sure if you noticed from your explanation above.

Yes, I understand, fixed in util-linux master branch.

https://github.com/karelzak/util-linux/commit/4a4616b22e21f3c05030dc05d98016fcb15a2ee9

Karel


--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2014-11-10 17:00:36

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 4/5] brd: Request from fdisk 4k alignment

>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:

Boaz> I do not know why I thought that only io_min does that, I can see
Boaz> now that both effect the Kernel the same way. Which scares me a
Boaz> bit.

The difference is subtle. io_min is a hint from the storage about its
preferred minimum I/O size. pbs describes the smallest unit that can be
atomically written (like a sector on a drive with 512-byte logical/4K
physical blocks). In most cases they are the same, pbs is just a
slightly harder guarantee than io_min.

What I was objecting to in your patch description was mostly the
statement you made that these values affect kernel behavior. They really
don't. Not directly, anyway. The queue limits are stacked and offsets
are adjusted based on partitions, etc. But they don't alter the kernel
runtime behavior.

The queue limits are reported to userland and will affect things like
partitioning, MD/DM tooling and mkfs.*. And therefore they only
indirectly affect the kernel's behavior.

--
Martin K. Petersen Oracle Linux Engineering