2007-02-21 21:23:57

by Mike Miller (OS Dev)

[permalink] [raw]
Subject: [Patch 1/2] cciss: fix for 2TB support

Patch 1/2

This patch changes the way we determine if a logical volume is larger than 2TB. The
original test looked for a total_size of 0. Originally we added 1 to the total_size.
That would make our read_capacity return size 0 for >2TB lv's. We assumed that we
could not have a lv size of 0 so it seemed OK until we were in a clustered system. The
backup node would see a size of 0 due to the reservation on the drive. That caused
the driver to switch to 16-byte CDB's which are not supported on older controllers.
After that everything was broken.
It may seem petty but I don't see the value in trying to determine if the LBA is
beyond the 2TB boundary. That's why when we switch we use 16-byte CDB's for all
read/write operations.
Please consider this for inclusion.

Signed-off-by: Mike Miller <[email protected]>
------------------------------------------------------------------------------------------
diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 05dfe35..916aab0 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1291,13 +1291,19 @@ static void cciss_update_drive_info(int
if (inq_buff == NULL)
goto mem_msg;

+ /* testing to see if 16-byte CDBs are already being used */
+ if (h->cciss_read == CCISS_READ_16) {
+ cciss_read_capacity_16(h->ctlr, drv_index, 1,
+ &total_size, &block_size);
+ goto geo_inq;
+ }
+
cciss_read_capacity(ctlr, drv_index, 1,
&total_size, &block_size);

- /* total size = last LBA + 1 */
- /* FFFFFFFF + 1 = 0, cannot have a logical volume of size 0 */
- /* so we assume this volume this must be >2TB in size */
- if (total_size == (__u32) 0) {
+ /* if read_capacity returns all F's this volume is >2TB in size */
+ /* so we switch to 16-byte CDB's for all read/write ops */
+ if (total_size == 0xFFFFFFFF) {
cciss_read_capacity_16(ctlr, drv_index, 1,
&total_size, &block_size);
h->cciss_read = CCISS_READ_16;
@@ -1306,6 +1312,7 @@ static void cciss_update_drive_info(int
h->cciss_read = CCISS_READ_10;
h->cciss_write = CCISS_WRITE_10;
}
+geo_inq:
cciss_geometry_inquiry(ctlr, drv_index, 1, total_size, block_size,
inq_buff, &h->drv[drv_index]);

@@ -1917,13 +1924,14 @@ static void cciss_geometry_inquiry(int c
drv->raid_level = inq_buff->data_byte[8];
}
drv->block_size = block_size;
- drv->nr_blocks = total_size;
+ drv->nr_blocks = total_size + 1;
t = drv->heads * drv->sectors;
if (t > 1) {
- unsigned rem = sector_div(total_size, t);
+ sector_t real_size = total_size + 1;
+ unsigned long rem = sector_div(real_size, t);
if (rem)
- total_size++;
- drv->cylinders = total_size;
+ real_size++;
+ drv->cylinders = real_size;
}
} else { /* Get geometry failed */
printk(KERN_WARNING "cciss: reading geometry failed\n");
@@ -1953,16 +1961,16 @@ cciss_read_capacity(int ctlr, int logvol
ctlr, buf, sizeof(ReadCapdata_struct),
1, logvol, 0, NULL, TYPE_CMD);
if (return_code == IO_OK) {
- *total_size = be32_to_cpu(*(__u32 *) buf->total_size)+1;
+ *total_size = be32_to_cpu(*(__u32 *) buf->total_size);
*block_size = be32_to_cpu(*(__u32 *) buf->block_size);
} else { /* read capacity command failed */
printk(KERN_WARNING "cciss: read capacity failed\n");
*total_size = 0;
*block_size = BLOCK_SIZE;
}
- if (*total_size != (__u32) 0)
+ if (*total_size != 0)
printk(KERN_INFO " blocks= %llu block_size= %d\n",
- (unsigned long long)*total_size, *block_size);
+ (unsigned long long)*total_size+1, *block_size);
kfree(buf);
return;
}
@@ -1989,7 +1997,7 @@ cciss_read_capacity_16(int ctlr, int log
1, logvol, 0, NULL, TYPE_CMD);
}
if (return_code == IO_OK) {
- *total_size = be64_to_cpu(*(__u64 *) buf->total_size)+1;
+ *total_size = be64_to_cpu(*(__u64 *) buf->total_size);
*block_size = be32_to_cpu(*(__u32 *) buf->block_size);
} else { /* read capacity command failed */
printk(KERN_WARNING "cciss: read capacity failed\n");
@@ -1997,7 +2005,7 @@ cciss_read_capacity_16(int ctlr, int log
*block_size = BLOCK_SIZE;
}
printk(KERN_INFO " blocks= %llu block_size= %d\n",
- (unsigned long long)*total_size, *block_size);
+ (unsigned long long)*total_size+1, *block_size);
kfree(buf);
return;
}
@@ -3119,8 +3127,9 @@ #endif /* CCISS_DEBUG */
}
cciss_read_capacity(cntl_num, i, 0, &total_size, &block_size);

- /* total_size = last LBA + 1 */
- if(total_size == (__u32) 0) {
+ /* If read_capacity returns all F's the logical is >2TB */
+ /* so we switch to 16-byte CDBs for all read/write ops */
+ if(total_size == 0xFFFFFFFF) {
cciss_read_capacity_16(cntl_num, i, 0,
&total_size, &block_size);
hba[cntl_num]->cciss_read = CCISS_READ_16;


2007-02-22 03:17:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Wed, 21 Feb 2007 15:10:39 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:

> Patch 1/2
>
> This patch changes the way we determine if a logical volume is larger than 2TB. The
> original test looked for a total_size of 0. Originally we added 1 to the total_size.
> That would make our read_capacity return size 0 for >2TB lv's. We assumed that we
> could not have a lv size of 0 so it seemed OK until we were in a clustered system. The
> backup node would see a size of 0 due to the reservation on the drive. That caused
> the driver to switch to 16-byte CDB's which are not supported on older controllers.
> After that everything was broken.
> It may seem petty but I don't see the value in trying to determine if the LBA is
> beyond the 2TB boundary. That's why when we switch we use 16-byte CDB's for all
> read/write operations.
> Please consider this for inclusion.
>
> ...
>
> + if (total_size == 0xFFFFFFFF) {

I seem to remember having already questioned this. total_size is sector_t, which
can be either 32-bit or 64-bit. Are you sure that comparison works as
intended in both cases?


> + if(total_size == 0xFFFFFFFF) {
> cciss_read_capacity_16(cntl_num, i, 0,
> &total_size, &block_size);
> hba[cntl_num]->cciss_read = CCISS_READ_16;

Here too.

2007-02-22 07:32:10

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] Speedup divides by cpu_power in scheduler

--- linux-2.6.21-rc1/include/linux/sched.h 2007-02-21 21:08:32.000000000 +0100
+++ linux-2.6.21-rc1-ed/include/linux/sched.h 2007-02-22 08:53:26.000000000 +0100
@@ -669,7 +669,12 @@ struct sched_group {
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU. This is read only (except for setup, hotplug CPU).
*/
- unsigned long cpu_power;
+ unsigned int cpu_power;
+ /*
+ * reciprocal value of cpu_power to avoid expensive divides
+ * (see include/linux/reciprocal_div.h)
+ */
+ u32 reciprocal_cpu_power;
};

struct sched_domain {
--- linux-2.6.21-rc1/kernel/sched.c.orig 2007-02-21 21:10:54.000000000 +0100
+++ linux-2.6.21-rc1-ed/kernel/sched.c 2007-02-22 08:46:56.000000000 +0100
@@ -52,6 +52,7 @@
#include <linux/tsacct_kern.h>
#include <linux/kprobes.h>
#include <linux/delayacct.h>
+#include <linux/reciprocal_div.h>
#include <asm/tlb.h>

#include <asm/unistd.h>
@@ -182,6 +183,26 @@ static unsigned int static_prio_timeslic
}

/*
+ * Divide a load by a sched group cpu_power : (load / sg->cpu_power)
+ * Since cpu_power is a 'constant', we can use a reciprocal divide.
+ */
+static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
+{
+ return reciprocal_divide(load, sg->reciprocal_cpu_power);
+}
+/*
+ * Each time a sched group cpu_power is changed,
+ * we must compute its reciprocal value
+ */
+static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
+{
+ sg->cpu_power += val;
+ BUG_ON(sg->cpu_power == 0);
+ sg->reciprocal_cpu_power = reciprocal_value(sg->cpu_power);
+}
+
+
+/*
* task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
* to time slice values: [800ms ... 100ms ... 5ms]
*
@@ -1241,7 +1262,8 @@ find_idlest_group(struct sched_domain *s
}

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ avg_load = sg_div_cpu_power(group,
+ avg_load * SCHED_LOAD_SCALE);

if (local_group) {
this_load = avg_load;
@@ -2355,7 +2377,8 @@ find_busiest_group(struct sched_domain *
total_pwr += group->cpu_power;

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ avg_load = sg_div_cpu_power(group,
+ avg_load * SCHED_LOAD_SCALE);

group_capacity = group->cpu_power / SCHED_LOAD_SCALE;

@@ -2510,8 +2533,8 @@ small_imbalance:
pwr_now /= SCHED_LOAD_SCALE;

/* Amount of load we'd subtract */
- tmp = busiest_load_per_task * SCHED_LOAD_SCALE /
- busiest->cpu_power;
+ tmp = sg_div_cpu_power(busiest,
+ busiest_load_per_task * SCHED_LOAD_SCALE);
if (max_load > tmp)
pwr_move += busiest->cpu_power *
min(busiest_load_per_task, max_load - tmp);
@@ -2519,10 +2542,11 @@ small_imbalance:
/* Amount of load we'd add */
if (max_load * busiest->cpu_power <
busiest_load_per_task * SCHED_LOAD_SCALE)
- tmp = max_load * busiest->cpu_power / this->cpu_power;
+ tmp = sg_div_cpu_power(this,
+ max_load * busiest->cpu_power);
else
- tmp = busiest_load_per_task * SCHED_LOAD_SCALE /
- this->cpu_power;
+ tmp = sg_div_cpu_power(this,
+ busiest_load_per_task * SCHED_LOAD_SCALE);
pwr_move += this->cpu_power *
min(this_load_per_task, this_load + tmp);
pwr_move /= SCHED_LOAD_SCALE;
@@ -6352,7 +6376,7 @@ next_sg:
continue;
}

- sg->cpu_power += sd->groups->cpu_power;
+ sg_inc_cpu_power(sg, sd->groups->cpu_power);
}
sg = sg->next;
if (sg != group_head)
@@ -6427,6 +6451,8 @@ static void init_sched_groups_power(int

child = sd->child;

+ sd->groups->cpu_power = 0;
+
/*
* For perf policy, if the groups in child domain share resources
* (for example cores sharing some portions of the cache hierarchy
@@ -6437,18 +6463,16 @@ static void init_sched_groups_power(int
if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
(child->flags &
(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
- sd->groups->cpu_power = SCHED_LOAD_SCALE;
+ sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
return;
}

- sd->groups->cpu_power = 0;
-
/*
* add cpu_power of each child group to this groups cpu_power
*/
group = child->groups;
do {
- sd->groups->cpu_power += group->cpu_power;
+ sg_inc_cpu_power(sd->groups, group->cpu_power);
group = group->next;
} while (group != child->groups);
}


Attachments:
sched_use_reciprocal_divide.patch (4.26 kB)

2007-02-22 08:01:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Speedup divides by cpu_power in scheduler


* Eric Dumazet <[email protected]> wrote:

> I noticed expensive divides done in try_to_wakeup() and
> find_busiest_group() on a bi dual core Opteron machine (total of 4
> cores), moderatly loaded (15.000 context switch per second)
>
> oprofile numbers :

nice patch! Ack for -mm testing:

Acked-by: Ingo Molnar <[email protected]>

one general suggestion: could you rename ->cpu_power to ->__cpu_power?
That makes it perfectly clear that this field's semantics have changed
and that it should never be manipulated directly without also changing
->reciprocal_cpu_power, and will also flag any out of tree code
trivially.

> + * Divide a load by a sched group cpu_power : (load / sg->cpu_power)
> + * Since cpu_power is a 'constant', we can use a reciprocal divide.
> + */
> +static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
> +{
> + return reciprocal_divide(load, sg->reciprocal_cpu_power);
> +}
> +/*
> + * Each time a sched group cpu_power is changed,
> + * we must compute its reciprocal value
> + */
> +static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
> +{
> + sg->cpu_power += val;
> + BUG_ON(sg->cpu_power == 0);
> + sg->reciprocal_cpu_power = reciprocal_value(sg->cpu_power);
> +}

Could you remove the BUG_ON() - it will most likely cause the
non-inlining of these functions if CONFIG_CC_OPTIMIZE_FOR_SIZE=y and
CONFIG_FORCED_INLINING is disabled (which is a popular combination in
distro kernels, it reduces the kernel's size by over 30%). And it's not
like we'll be able to overlook a divide by zero crash in
reciprocal_value() anyway, if cpu_power were to be zero ;-)

Ingo

2007-02-22 08:19:59

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH, take 2] Speedup divides by cpu_power in scheduler

--- linux-2.6.21-rc1/include/linux/sched.h 2007-02-21 21:08:32.000000000 +0100
+++ linux-2.6.21-rc1-ed/include/linux/sched.h 2007-02-22 10:12:00.000000000 +0100
@@ -668,8 +668,14 @@ struct sched_group {
/*
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU. This is read only (except for setup, hotplug CPU).
+ * Note : Never change cpu_power without recompute its reciprocal
*/
- unsigned long cpu_power;
+ unsigned int __cpu_power;
+ /*
+ * reciprocal value of cpu_power to avoid expensive divides
+ * (see include/linux/reciprocal_div.h)
+ */
+ u32 reciprocal_cpu_power;
};

struct sched_domain {
--- linux-2.6.21-rc1/kernel/sched.c 2007-02-21 21:10:54.000000000 +0100
+++ linux-2.6.21-rc1-ed/kernel/sched.c 2007-02-22 10:12:00.000000000 +0100
@@ -52,6 +52,7 @@
#include <linux/tsacct_kern.h>
#include <linux/kprobes.h>
#include <linux/delayacct.h>
+#include <linux/reciprocal_div.h>
#include <asm/tlb.h>

#include <asm/unistd.h>
@@ -182,6 +183,25 @@ static unsigned int static_prio_timeslic
}

/*
+ * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
+ * Since cpu_power is a 'constant', we can use a reciprocal divide.
+ */
+static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
+{
+ return reciprocal_divide(load, sg->reciprocal_cpu_power);
+}
+/*
+ * Each time a sched group cpu_power is changed,
+ * we must compute its reciprocal value
+ */
+static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
+{
+ sg->__cpu_power += val;
+ sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
+}
+
+
+/*
* task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
* to time slice values: [800ms ... 100ms ... 5ms]
*
@@ -1241,7 +1261,8 @@ find_idlest_group(struct sched_domain *s
}

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ avg_load = sg_div_cpu_power(group,
+ avg_load * SCHED_LOAD_SCALE);

if (local_group) {
this_load = avg_load;
@@ -2352,12 +2373,13 @@ find_busiest_group(struct sched_domain *
}

total_load += avg_load;
- total_pwr += group->cpu_power;
+ total_pwr += group->__cpu_power;

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ avg_load = sg_div_cpu_power(group,
+ avg_load * SCHED_LOAD_SCALE);

- group_capacity = group->cpu_power / SCHED_LOAD_SCALE;
+ group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;

if (local_group) {
this_load = avg_load;
@@ -2468,8 +2490,8 @@ group_next:
max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);

/* How much load to actually move to equalise the imbalance */
- *imbalance = min(max_pull * busiest->cpu_power,
- (avg_load - this_load) * this->cpu_power)
+ *imbalance = min(max_pull * busiest->__cpu_power,
+ (avg_load - this_load) * this->__cpu_power)
/ SCHED_LOAD_SCALE;

/*
@@ -2503,27 +2525,28 @@ small_imbalance:
* moving them.
*/

- pwr_now += busiest->cpu_power *
+ pwr_now += busiest->__cpu_power *
min(busiest_load_per_task, max_load);
- pwr_now += this->cpu_power *
+ pwr_now += this->__cpu_power *
min(this_load_per_task, this_load);
pwr_now /= SCHED_LOAD_SCALE;

/* Amount of load we'd subtract */
- tmp = busiest_load_per_task * SCHED_LOAD_SCALE /
- busiest->cpu_power;
+ tmp = sg_div_cpu_power(busiest,
+ busiest_load_per_task * SCHED_LOAD_SCALE);
if (max_load > tmp)
- pwr_move += busiest->cpu_power *
+ pwr_move += busiest->__cpu_power *
min(busiest_load_per_task, max_load - tmp);

/* Amount of load we'd add */
- if (max_load * busiest->cpu_power <
+ if (max_load * busiest->__cpu_power <
busiest_load_per_task * SCHED_LOAD_SCALE)
- tmp = max_load * busiest->cpu_power / this->cpu_power;
+ tmp = sg_div_cpu_power(this,
+ max_load * busiest->__cpu_power);
else
- tmp = busiest_load_per_task * SCHED_LOAD_SCALE /
- this->cpu_power;
- pwr_move += this->cpu_power *
+ tmp = sg_div_cpu_power(this,
+ busiest_load_per_task * SCHED_LOAD_SCALE);
+ pwr_move += this->__cpu_power *
min(this_load_per_task, this_load + tmp);
pwr_move /= SCHED_LOAD_SCALE;

@@ -5486,7 +5509,7 @@ static void sched_domain_debug(struct sc
break;
}

- if (!group->cpu_power) {
+ if (!group->__cpu_power) {
printk("\n");
printk(KERN_ERR "ERROR: domain->cpu_power not "
"set\n");
@@ -5663,7 +5686,7 @@ init_sched_build_groups(cpumask_t span,
continue;

sg->cpumask = CPU_MASK_NONE;
- sg->cpu_power = 0;
+ sg->__cpu_power = 0;

for_each_cpu_mask(j, span) {
if (group_fn(j, cpu_map, NULL) != group)
@@ -6352,7 +6375,7 @@ next_sg:
continue;
}

- sg->cpu_power += sd->groups->cpu_power;
+ sg_inc_cpu_power(sg, sd->groups->__cpu_power);
}
sg = sg->next;
if (sg != group_head)
@@ -6427,6 +6450,8 @@ static void init_sched_groups_power(int

child = sd->child;

+ sd->groups->__cpu_power = 0;
+
/*
* For perf policy, if the groups in child domain share resources
* (for example cores sharing some portions of the cache hierarchy
@@ -6437,18 +6462,16 @@ static void init_sched_groups_power(int
if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
(child->flags &
(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
- sd->groups->cpu_power = SCHED_LOAD_SCALE;
+ sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
return;
}

- sd->groups->cpu_power = 0;
-
/*
* add cpu_power of each child group to this groups cpu_power
*/
group = child->groups;
do {
- sd->groups->cpu_power += group->cpu_power;
+ sg_inc_cpu_power(sd->groups, group->__cpu_power);
group = group->next;
} while (group != child->groups);
}
@@ -6608,7 +6631,7 @@ static int build_sched_domains(const cpu
sd = &per_cpu(node_domains, j);
sd->groups = sg;
}
- sg->cpu_power = 0;
+ sg->__cpu_power = 0;
sg->cpumask = nodemask;
sg->next = sg;
cpus_or(covered, covered, nodemask);
@@ -6636,7 +6659,7 @@ static int build_sched_domains(const cpu
"Can not alloc domain group for node %d\n", j);
goto error;
}
- sg->cpu_power = 0;
+ sg->__cpu_power = 0;
sg->cpumask = tmp;
sg->next = prev->next;
cpus_or(covered, covered, tmp);


Attachments:
sched_use_reciprocal_divide.patch (6.17 kB)

2007-02-22 08:23:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH, take 2] Speedup divides by cpu_power in scheduler


* Eric Dumazet <[email protected]> wrote:

> Ingo suggested to rename cpu_power to __cpu_power to make clear it
> should not be modified without changing its reciprocal value too.

thanks,

Acked-by: Ingo Molnar <[email protected]>

> I did not convert the divide in cpu_avg_load_per_task(), because
> tracking nr_running changes may be not worth it ? We could use a
> static table of 32 reciprocal values but it would add a conditional
> branch and table lookup.

not worth it i think. Lets wait for it to show up in an oprofile? (if
ever)

Ingo

2007-02-22 16:51:27

by Mike Miller (OS Dev)

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Wed, Feb 21, 2007 at 07:14:27PM -0800, Andrew Morton wrote:
> On Wed, 21 Feb 2007 15:10:39 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
>
> > Patch 1/2
> >
> > This patch changes the way we determine if a logical volume is larger than 2TB. The
> > original test looked for a total_size of 0. Originally we added 1 to the total_size.
> > That would make our read_capacity return size 0 for >2TB lv's. We assumed that we
> > could not have a lv size of 0 so it seemed OK until we were in a clustered system. The
> > backup node would see a size of 0 due to the reservation on the drive. That caused
> > the driver to switch to 16-byte CDB's which are not supported on older controllers.
> > After that everything was broken.
> > It may seem petty but I don't see the value in trying to determine if the LBA is
> > beyond the 2TB boundary. That's why when we switch we use 16-byte CDB's for all
> > read/write operations.
> > Please consider this for inclusion.
> >
> > ...
> >
> > + if (total_size == 0xFFFFFFFF) {
>
> I seem to remember having already questioned this. total_size is sector_t, which
> can be either 32-bit or 64-bit. Are you sure that comparison works as
> intended in both cases?
>
>
> > + if(total_size == 0xFFFFFFFF) {
> > cciss_read_capacity_16(cntl_num, i, 0,
> > &total_size, &block_size);
> > hba[cntl_num]->cciss_read = CCISS_READ_16;
>
> Here too.
It has worked in all of the configs I've tested. Should I change it from sector_t to a
__64? I have not tested all possible configs.

-- mikem

2007-02-22 20:18:12

by Mike Miller (OS Dev)

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Wed, Feb 21, 2007 at 07:14:27PM -0800, Andrew Morton wrote:
> >
> > + if (total_size == 0xFFFFFFFF) {
>
> I seem to remember having already questioned this. total_size is sector_t, which
> can be either 32-bit or 64-bit. Are you sure that comparison works as
> intended in both cases?
>
>
> > + if(total_size == 0xFFFFFFFF) {
> > cciss_read_capacity_16(cntl_num, i, 0,
> > &total_size, &block_size);
> > hba[cntl_num]->cciss_read = CCISS_READ_16;
>
> Here too.

Andrew,
Using this test program and changing the type of x to int, long, long long signed and
unsigned the comparison always worked on x86, x86_64, and ia64. It looks to me like
the comparsion will always do what we expect. Unless you see some other problem.


#include <stdio.h>

int main(int argc, char *argv[])
{
unsigned long long x;

x = 0x00000000ffffffff;

printf(sizeof(x) == 8 ?
"x = %lld, sizeof(x) = %d\n" :
"x = %ld, sizeof(x) = %d\n", x, sizeof(x));
if (x == 0xffffffff)
printf("equal\n");
else
printf("not equal\n");

}

-- mikem

2007-02-22 21:28:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

> On Thu, 22 Feb 2007 10:51:23 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> On Wed, Feb 21, 2007 at 07:14:27PM -0800, Andrew Morton wrote:
> > On Wed, 21 Feb 2007 15:10:39 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> >
> > > Patch 1/2
> > > + if (total_size == 0xFFFFFFFF) {
> >
> > I seem to remember having already questioned this. total_size is sector_t, which
> > can be either 32-bit or 64-bit. Are you sure that comparison works as
> > intended in both cases?
> >
> >
> > > + if(total_size == 0xFFFFFFFF) {
> > > cciss_read_capacity_16(cntl_num, i, 0,
> > > &total_size, &block_size);
> > > hba[cntl_num]->cciss_read = CCISS_READ_16;
> >
> > Here too.
> It has worked in all of the configs I've tested. Should I change it from sector_t to a
> __64? I have not tested all possible configs.
>

I'd suggest using -1: that just works.

2007-02-22 21:39:08

by Mike Miller

[permalink] [raw]
Subject: RE: [Patch 1/2] cciss: fix for 2TB support



> -----Original Message-----
> From: Mike Miller (OS Dev) [mailto:[email protected]]
>
> Andrew,
> Using this test program and changing the type of x to int,
> long, long long signed and unsigned the comparison always
> worked on x86, x86_64, and ia64. It looks to me like the
> comparsion will always do what we expect. Unless you see some
> other problem.
>
>
> #include <stdio.h>
>
> int main(int argc, char *argv[])
> {
> unsigned long long x;
>
> x = 0x00000000ffffffff;
>
> printf(sizeof(x) == 8 ?
> "x = %lld, sizeof(x) = %d\n" :
> "x = %ld, sizeof(x) = %d\n", x, sizeof(x));
> if (x == 0xffffffff)
> printf("equal\n");
> else
> printf("not equal\n");
>
> }
>
> -- mikem
>
BTW: also changed x to be 8 f's, 16 f's, and 8 and 8 as shown.

2007-02-22 21:42:41

by James Bottomley

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Thu, 2007-02-22 at 13:24 -0800, Andrew Morton wrote:
> > On Thu, 22 Feb 2007 10:51:23 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> > On Wed, Feb 21, 2007 at 07:14:27PM -0800, Andrew Morton wrote:
> > > On Wed, 21 Feb 2007 15:10:39 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> > >
> > > > Patch 1/2
> > > > + if (total_size == 0xFFFFFFFF) {
> > >
> > > I seem to remember having already questioned this. total_size is sector_t, which
> > > can be either 32-bit or 64-bit. Are you sure that comparison works as
> > > intended in both cases?
> > >
> > >
> > > > + if(total_size == 0xFFFFFFFF) {
> > > > cciss_read_capacity_16(cntl_num, i, 0,
> > > > &total_size, &block_size);
> > > > hba[cntl_num]->cciss_read = CCISS_READ_16;
> > >
> > > Here too.
> > It has worked in all of the configs I've tested. Should I change it from sector_t to a
> > __64? I have not tested all possible configs.
> >
>
> I'd suggest using -1: that just works.

Actually, no, that won't work.

This is a SCSI heuristic for determining when to use the 16 byte version
of the read capacity command. The 10 byte command can only return 32
bits of information (this is in sectors, so it returns up to 2TB of
bytes).

The heuristic requirement is that if the size is exactly 0xffffffff then
you should try the 16 byte command (which can return 64 bits of
information). If that fails then you assume the 0xfffffff is a real
size otherwize, you assume it was truncated and take the real result
from the 16 byte command.

You can see a far more elaborate version of this in operation in
sd.c:sd_read_capacity().

The only thing I'd suggest is to use 0xFFFFFFFFULL as the constant to
prevent sign extension issues.

James


2007-02-22 22:02:43

by Mike Miller (OS Dev)

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Thu, Feb 22, 2007 at 03:41:24PM -0600, James Bottomley wrote:
> On Thu, 2007-02-22 at 13:24 -0800, Andrew Morton wrote:
> > > On Thu, 22 Feb 2007 10:51:23 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> > > On Wed, Feb 21, 2007 at 07:14:27PM -0800, Andrew Morton wrote:
> > > > On Wed, 21 Feb 2007 15:10:39 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> > > >
> > > > > Patch 1/2
> > > > > + if (total_size == 0xFFFFFFFF) {
> > > >
> > > > I seem to remember having already questioned this. total_size is sector_t, which
> > > > can be either 32-bit or 64-bit. Are you sure that comparison works as
> > > > intended in both cases?
> > > >
> > > >
> > > > > + if(total_size == 0xFFFFFFFF) {
> > > > > cciss_read_capacity_16(cntl_num, i, 0,
> > > > > &total_size, &block_size);
> > > > > hba[cntl_num]->cciss_read = CCISS_READ_16;
> > > >
> > > > Here too.
> > > It has worked in all of the configs I've tested. Should I change it from sector_t to a
> > > __64? I have not tested all possible configs.
> > >
> >
> > I'd suggest using -1: that just works.
>
> Actually, no, that won't work.
>
> This is a SCSI heuristic for determining when to use the 16 byte version
> of the read capacity command. The 10 byte command can only return 32
> bits of information (this is in sectors, so it returns up to 2TB of
> bytes).
>
> The heuristic requirement is that if the size is exactly 0xffffffff then
> you should try the 16 byte command (which can return 64 bits of
> information). If that fails then you assume the 0xfffffff is a real
> size otherwize, you assume it was truncated and take the real result
> from the 16 byte command.
>
> You can see a far more elaborate version of this in operation in
> sd.c:sd_read_capacity().
>
> The only thing I'd suggest is to use 0xFFFFFFFFULL as the constant to
> prevent sign extension issues.
>
> James
>
>
Will this patch for my patch work for now?


diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 1abf1f5..a1f1d9f 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1303,7 +1303,7 @@ static void cciss_update_drive_info(int

/* if read_capacity returns all F's this volume is >2TB in size */
/* so we switch to 16-byte CDB's for all read/write ops */
- if (total_size == 0xFFFFFFFF) {
+ if (total_size == 0xFFFFFFFFULL) {
cciss_read_capacity_16(ctlr, drv_index, 1,
&total_size, &block_size);
h->cciss_read = CCISS_READ_16;
@@ -3129,7 +3129,7 @@ #endif /* CCISS_DEBUG */

/* If read_capacity returns all F's the logical is >2TB */
/* so we switch to 16-byte CDBs for all read/write ops */
- if(total_size == 0xFFFFFFFF) {
+ if(total_size == 0xFFFFFFFFULL) {
cciss_read_capacity_16(cntl_num, i, 0,
&total_size, &block_size);
hba[cntl_num]->cciss_read = CCISS_READ_16;

2007-02-22 22:07:56

by James Bottomley

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Thu, 2007-02-22 at 16:02 -0600, Mike Miller (OS Dev) wrote:
> Will this patch for my patch work for now?

Yes, I think that should be fine ... it's only a theoretical worry; at
the moment sector_t is unsigned ... but just in case.

James


2007-02-23 20:52:32

by Mike Miller (OS Dev)

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

On Thu, Feb 22, 2007 at 04:06:41PM -0600, James Bottomley wrote:
> On Thu, 2007-02-22 at 16:02 -0600, Mike Miller (OS Dev) wrote:
> > Will this patch for my patch work for now?
>
> Yes, I think that should be fine ... it's only a theoretical worry; at
> the moment sector_t is unsigned ... but just in case.
>
> James
>
>
Andrew,
Are you waiting for a new patch from me? Or is my patch's patch sufficient?

-- mikem

2007-02-24 06:39:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 1/2] cciss: fix for 2TB support

> On Fri, 23 Feb 2007 14:52:29 -0600 "Mike Miller (OS Dev)" <[email protected]> wrote:
> On Thu, Feb 22, 2007 at 04:06:41PM -0600, James Bottomley wrote:
> > On Thu, 2007-02-22 at 16:02 -0600, Mike Miller (OS Dev) wrote:
> > > Will this patch for my patch work for now?
> >
> > Yes, I think that should be fine ... it's only a theoretical worry; at
> > the moment sector_t is unsigned ... but just in case.
> >
> > James
> >
> >
> Andrew,
> Are you waiting for a new patch from me? Or is my patch's patch sufficient?
>

It looked OK. But I'm travelling at present, will get back into things
late next week.