2014-07-30 14:16:08

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 1/2] brd: Fix the partitions BUG


With current code after a call to:
bdev = blkdev_get_by_path(dev_name, mode, fs_type);
size = i_size_read(bdev->bd_inode);
part_size = bdev->bd_part->nr_sects << 9;

I get the following bad results:
dev_name == /dev/ram0
foo: [foo_mount:880] size=0x40000000 bdev=ffff88003dc24340 \
bd_inode=ffff88003dc24430 bd_part=ffff88003ca22848 part_size=0x40000000
dev_name == /dev/ram0p1
foo: [foo_mount:880] size=0x40000000 bdev=ffff88003d2f6d80 \
bd_inode=ffff88003d2f6e70 bd_part=ffff88003ca22848 part_size=0x40000000
dev_name == /dev/ram0p2
foo: [foo_mount:880] size=0x40000000 bdev=ffff88003dc24680 \
bd_inode=ffff88003dc24770 bd_part=ffff88003ca22848 part_size=0x40000000
Note how all three bdev(s) point to the same bd_part.

This is do to a single bad clubber in brd_probe() which is
removed in this patch:
- *part = 0;

because of this all 3 bdev(s) above get to point to the same bd_part[0]

While at it fix/rename brd_init_one() since all devices are created on
load of driver, brd_probe() will never be called with a new un-created
device.
brd_init_one() is now renamed to brd_find() which is what it does.

TODO: There is one more partitions BUG regarding
brd_direct_access() which is fixed on the next patch.

Signed-off-by: Boaz Harrosh <[email protected]>
---
drivers/block/brd.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c7d138e..92334f6 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -523,22 +523,20 @@ static void brd_free(struct brd_device *brd)
kfree(brd);
}

-static struct brd_device *brd_init_one(int i)
+static struct brd_device *brd_find(int i)
{
struct brd_device *brd;

list_for_each_entry(brd, &brd_devices, brd_list) {
if (brd->brd_number == i)
- goto out;
+ return brd;
}

- brd = brd_alloc(i);
- if (brd) {
- add_disk(brd->brd_disk);
- list_add_tail(&brd->brd_list, &brd_devices);
- }
-out:
- return brd;
+ /* brd always allocates all its devices at load time, therefor
+ * brd_probe will never be called with a new brd_number
+ */
+ printk(KERN_EROR "brd: brd_find unexpected device %d\n", i);
+ return NULL;
}

static void brd_del_one(struct brd_device *brd)
@@ -554,11 +552,10 @@ static struct kobject *brd_probe(dev_t dev, int *part, void *data)
struct kobject *kobj;

mutex_lock(&brd_devices_mutex);
- brd = brd_init_one(MINOR(dev) >> part_shift);
+ brd = brd_find(MINOR(dev) >> part_shift);
kobj = brd ? get_disk(brd->brd_disk) : NULL;
mutex_unlock(&brd_devices_mutex);

- *part = 0;
return kobj;
}

--
1.9.3


2014-07-30 14:18:55

by Boaz Harrosh

[permalink] [raw]
Subject: [PATCH 2/2] brd: Fix brd_direct_access with partitions


When brd_direct_access() is called on a partition-bdev
it would access the wrong sector. And caller would then
corrupt the device's data.

This is a preliminary fix, Matthew Wilcox has a patch
in his DAX patchset which will define a global wrapper
to bdev->bd_disk->fops->direct_access that will do the
proper checks and translations before calling a driver
global member. (The way it is done at the rest of the
block stack)

CC: Matthew Wilcox <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
drivers/block/brd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 92334f6..7506864 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -378,9 +378,10 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,

if (!brd)
return -ENODEV;
+ sector += get_start_sect(bdev);
if (sector & (PAGE_SECTORS-1))
return -EINVAL;
- if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
+ if (unlikely(sector + PAGE_SECTORS > part_nr_sects_read(bdev->bd_part)))
return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
--
1.9.3

2014-07-30 15:34:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/2] brd: Fix brd_direct_access with partitions

On Wed, Jul 30, 2014 at 05:18:47PM +0300, Boaz Harrosh wrote:
> When brd_direct_access() is called on a partition-bdev
> it would access the wrong sector. And caller would then
> corrupt the device's data.
>
> This is a preliminary fix, Matthew Wilcox has a patch
> in his DAX patchset which will define a global wrapper
> to bdev->bd_disk->fops->direct_access that will do the
> proper checks and translations before calling a driver
> global member. (The way it is done at the rest of the
> block stack)

Uh, no, let's just focus on getting the DAX code in instead of putting
in interim patches that will conflict. Patch 4/22 is uncontroversial,
fixes this problem, has no dependencies, and is key to the rest of the DAX
patchset. If this problem wants to be fixed, then put 4/22 in instead.

2014-07-30 15:37:51

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 2/2] brd: Fix brd_direct_access with partitions

On 07/30/2014 06:34 PM, Matthew Wilcox wrote:
> On Wed, Jul 30, 2014 at 05:18:47PM +0300, Boaz Harrosh wrote:
>> When brd_direct_access() is called on a partition-bdev
>> it would access the wrong sector. And caller would then
>> corrupt the device's data.
>>
>> This is a preliminary fix, Matthew Wilcox has a patch
>> in his DAX patchset which will define a global wrapper
>> to bdev->bd_disk->fops->direct_access that will do the
>> proper checks and translations before calling a driver
>> global member. (The way it is done at the rest of the
>> block stack)
>
> Uh, no, let's just focus on getting the DAX code in instead of putting
> in interim patches that will conflict. Patch 4/22 is uncontroversial,
> fixes this problem, has no dependencies, and is key to the rest of the DAX
> patchset. If this problem wants to be fixed, then put 4/22 in instead.
>

OK, I agree, could you please push 4/22 as reply to here for
Jens to take for 3.17 ? (together with my 1/2)

Thanks
Boaz

2014-07-30 16:50:20

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 1/2] brd: Fix the partitions BUG

On Wed, 2014-07-30 at 17:15 +0300, Boaz Harrosh wrote:
> With current code after a call to:
> bdev = blkdev_get_by_path(dev_name, mode, fs_type);
> size = i_size_read(bdev->bd_inode);
> part_size = bdev->bd_part->nr_sects << 9;
>
> I get the following bad results:
> dev_name == /dev/ram0
> foo: [foo_mount:880] size=0x40000000 bdev=ffff88003dc24340 \
> bd_inode=ffff88003dc24430 bd_part=ffff88003ca22848 part_size=0x40000000
> dev_name == /dev/ram0p1
> foo: [foo_mount:880] size=0x40000000 bdev=ffff88003d2f6d80 \
> bd_inode=ffff88003d2f6e70 bd_part=ffff88003ca22848 part_size=0x40000000
> dev_name == /dev/ram0p2
> foo: [foo_mount:880] size=0x40000000 bdev=ffff88003dc24680 \
> bd_inode=ffff88003dc24770 bd_part=ffff88003ca22848 part_size=0x40000000
> Note how all three bdev(s) point to the same bd_part.
>
> This is do to a single bad clubber in brd_probe() which is
> removed in this patch:
> - *part = 0;
>
> because of this all 3 bdev(s) above get to point to the same bd_part[0]
>
> While at it fix/rename brd_init_one() since all devices are created on
> load of driver, brd_probe() will never be called with a new un-created
> device.
> brd_init_one() is now renamed to brd_find() which is what it does.
>
> TODO: There is one more partitions BUG regarding
> brd_direct_access() which is fixed on the next patch.
>
> Signed-off-by: Boaz Harrosh <[email protected]>
> ---
> drivers/block/brd.c | 19 ++++++++-----------
> 1 file changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index c7d138e..92334f6 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -523,22 +523,20 @@ static void brd_free(struct brd_device *brd)
> kfree(brd);
> }
>
> -static struct brd_device *brd_init_one(int i)
> +static struct brd_device *brd_find(int i)
> {
> struct brd_device *brd;
>
> list_for_each_entry(brd, &brd_devices, brd_list) {
> if (brd->brd_number == i)
> - goto out;
> + return brd;
> }
>
> - brd = brd_alloc(i);
> - if (brd) {
> - add_disk(brd->brd_disk);
> - list_add_tail(&brd->brd_list, &brd_devices);
> - }
> -out:
> - return brd;
> + /* brd always allocates all its devices at load time, therefor
> + * brd_probe will never be called with a new brd_number
> + */
> + printk(KERN_EROR "brd: brd_find unexpected device %d\n", i);

s/KERN_EROR/KERN_ERR/

> + return NULL;
> }
>
> static void brd_del_one(struct brd_device *brd)
> @@ -554,11 +552,10 @@ static struct kobject *brd_probe(dev_t dev, int *part, void *data)
> struct kobject *kobj;
>
> mutex_lock(&brd_devices_mutex);
> - brd = brd_init_one(MINOR(dev) >> part_shift);
> + brd = brd_find(MINOR(dev) >> part_shift);
> kobj = brd ? get_disk(brd->brd_disk) : NULL;
> mutex_unlock(&brd_devices_mutex);
>
> - *part = 0;
> return kobj;
> }

It is possible to create new block devices with BRD at runtime:

# mknod /dev/new_brd b 1 4
# fdisk -l /dev/new_brd

This causes a new BRD disk to be created, and hits your error case:

Jul 30 10:40:57 alara kernel: brd: brd_find unexpected device 4

I guess in general I'm not saying that BRD needs to have partitions - indeed
it may not give you much in the way of added functionality. As the code
currently stands partitions aren't surfaced anyway
(GENHD_FL_SUPPRESS_PARTITION_INFO is set). For PRD, however, I *do* want to
enable partitions correctly because eventually I'd like to enhance PRD so that
it *does* actually handle NVDIMMs correctly, and for that partitions do make
sense. And if I have to implement and debug partitions for PRD, it's easy to
stick them in BRD in case anyone wants to use them.

- Ross

2014-07-30 18:37:40

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 1/2] brd: Fix the partitions BUG

On 07/30/2014 07:50 PM, Ross Zwisler wrote:
<>
>> + */
>> + printk(KERN_EROR "brd: brd_find unexpected device %d\n", i);
>
> s/KERN_EROR/KERN_ERR/
>

Yes thanks, sigh, code should compile

driver error. I used pr_err but last inspection I saw that printk is used
everywhere and, crapped ...

>> + return NULL;
>> }
>>
>> static void brd_del_one(struct brd_device *brd)
>> @@ -554,11 +552,10 @@ static struct kobject *brd_probe(dev_t dev, int *part, void *data)
>> struct kobject *kobj;
>>
>> mutex_lock(&brd_devices_mutex);
>> - brd = brd_init_one(MINOR(dev) >> part_shift);
>> + brd = brd_find(MINOR(dev) >> part_shift);
>> kobj = brd ? get_disk(brd->brd_disk) : NULL;
>> mutex_unlock(&brd_devices_mutex);
>>
>> - *part = 0;
>> return kobj;
>> }
>
> It is possible to create new block devices with BRD at runtime:
>
> # mknod /dev/new_brd b 1 4
> # fdisk -l /dev/new_brd
>
> This causes a new BRD disk to be created, and hits your error case:
>
> Jul 30 10:40:57 alara kernel: brd: brd_find unexpected device 4
>

Ha OK, So this is the mystery key. I never knew that trick, OK
I guess I need to leave this in

> I guess in general I'm not saying that BRD needs to have partitions - indeed
> it may not give you much in the way of added functionality. As the code
> currently stands partitions aren't surfaced anyway
> (GENHD_FL_SUPPRESS_PARTITION_INFO is set). For PRD, however, I *do* want to
> enable partitions correctly because eventually I'd like to enhance PRD so that
> it *does* actually handle NVDIMMs correctly, and for that partitions do make
> sense.

So lets talk about that for a bit. Why would you want legacy partitions for
NVDIMMs. For one fdisk will waste 1M of memory on this. And with NVDIMMs you
actually need a different manager all together, more like lvm stuff.

I have patches here that changes prd a lot.
- Global memory parameters are gone each device remaps and operates on it's own memory range.
- the prd_params are gone instead you have one memmap= parameter of the form
memmap=nnn$ooo,[nnn$ooo]...
where:
nnn - is size in bytes of the form number[K/M/G]
ooo - is offset in bytes of the form number[K/M/G]

Very much the same as the memmap= Kernel parameters. So just copy/paste what
you did there and it will work.

Now each memory range has one prd_device created for it. Note that if
specified ranges are just contiguous then it is like your prd_count thing where
you took a contiguous range and sliced it up, only here they can be of
different size.
Off course it has an added fixture of dis-contiguous ranges like we have
in our multy-nodes NUMA system in the lab that host DDR3 NvDIMMs in two
banks.

- A sysfs interface is added to add/remove memory range on the fly like
osdblk

- If no parameters are specified at all, the Kernel command-line is parsed
and all memmap= sections are attempted and used if are not already claimed.

An interface like mknod above is not supported since it is actually pointless.

My current code still supports partitions but I still think it is silly.

[
This is all speculative for DDR3-NVDIMMs, we have seen that the memory controller
actually tags these DIMMs with type 12 but it looks like this is all vendor
specific right now and I understand that DDR4 standardize all that. So I was
hoping you guys are working on all that with the NvDIMM stuff.

Now lets say that I have established auto-detection for each DIMM and have
extracted its SN (Our DDR3 DIMMs each have an SN that can be displayed by the
vendor supplied tool)

An NvDIMM manager will need to establish an NvDIMM table and set order.
The motivation of a partition table is that a disk moved to another machine
and/or booted into another OS will have persistent and STD way to not
clobber over established DATA. But here we have like a disk cluster/raid,
an ordered set of drives.

Unique to NvDIMMs is the interleaving. If pairs are then inserted in wrong
order or are re-mixed. This should be detected and refused to be mounted
(unless a re-initialize is forced) So data is not silently lost on operator
errors.

All this is then persisted on some DIMM, first or last. If code is able to
re-arrange order, like when physical address have changed it will, but in the
wrong interleaving case it will refuse to mount.
]

> And if I have to implement and debug partitions for PRD, it's easy to
> stick them in BRD in case anyone wants to use them.
>

I was finally playing with BRD, and it is missing your getgeo patch, so
it is currently completely alien to fdisk and partitions.

So why, why keep a fixture that never existed, I still hope to convince you
that this is crap. Specially with brd that can have devices added on the fly
like you showed above. Who needs them at all? Why waist 1M of memory
on each device for no extra gain?

Specially in light of my new prd that does away of any needs of a partitioning
since it supports arbitrary slicing in another way.

> - Ross

I will send a new set of patches for brd tomorrow. If we want, really want,
NEW SUPPORT for partitions, we need 2 more patch.
But PLEASE consider just removing this dead never used crap. If you agree
I will send a cleaning patch ASAP.

Also I will send as RFC my prd patches if you want to see, it has some
nice surprises in them that I hope you will like ;-)

Cheers
Boaz