2009-07-20 04:10:28

by Greg KH

[permalink] [raw]
Subject: Linux 2.6.27.27

I'm announcing the release of the 2.6.27.26 kernel. All users of the
2.6.27 kernel series are very strongly encouraged to upgrade.

I'll also be replying to this message with a copy of the patch between
2.6.27.26 and 2.6.27.27

The updated 2.6.27.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
and can be browsed at the normal kernel.org git web browser:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary

thanks,

greg k-h

------------

Makefile | 7 ++-
drivers/block/floppy.c | 5 ++
drivers/md/dm.c | 4 --
drivers/net/tulip/interrupt.c | 84 +++++++++++++++++++++++++++---------------
drivers/net/tulip/tulip.h | 32 +++++++++++++++-
drivers/pci/iova.c | 26 +++++++++++--
include/linux/mm.h | 2 -
include/linux/personality.h | 5 ++
include/linux/security.h | 2 +
kernel/resource.c | 2 -
kernel/sysctl.c | 2 -
mm/Kconfig | 18 +++++++++
mm/mmap.c | 3 +
security/Kconfig | 22 -----------
security/security.c | 3 -
15 files changed, 145 insertions(+), 72 deletions(-)

Christoph Lameter (1):
security: use mmap_min_addr indepedently of security models

David Woodhouse (1):
Fix iommu address space allocation

Eugene Teo (1):
Add '-fno-delete-null-pointer-checks' to gcc CFLAGS

Greg Kroah-Hartman (2):
Revert "dm: sysfs skip output when device is being destroyed"
Linux 2.6.27.27

Jiri Slaby (1):
floppy: fix lock imbalance

Julien Tinnes (1):
personality: fix PER_CLEAR_ON_SETID (CVE-2009-1895)

Linus Torvalds (1):
Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x

Tomasz Lemiech (1):
tulip: Fix for MTU problems with 802.1q tagged frames

Zhang Rui (1):
kernel/resource.c: fix sign extension in reserve_setup()


2009-07-20 04:10:41

by Greg KH

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

diff --git a/Makefile b/Makefile
index 90764ee..387a5fd 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 27
-EXTRAVERSION = .26
+EXTRAVERSION = .27
NAME = Trembling Tortoise

# *DOCUMENTATION*
@@ -340,7 +340,8 @@ KBUILD_CPPFLAGS := -D__KERNEL__ $(LINUXINCLUDE)

KBUILD_CFLAGS := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
-fno-strict-aliasing -fno-common \
- -Werror-implicit-function-declaration
+ -Werror-implicit-function-declaration \
+ -fno-delete-null-pointer-checks
KBUILD_AFLAGS := -D__ASSEMBLY__

# Read KERNELRELEASE from include/config/kernel.release (if it exists)
@@ -556,7 +557,7 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)

# disable invalid "can't wrap" optimzations for signed / pointers
-KBUILD_CFLAGS += $(call cc-option,-fwrapv)
+KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow)

# Add user supplied CPPFLAGS, AFLAGS and CFLAGS as the last assignments
# But warn user when we do so
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 615fcd3..5900f76 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3320,7 +3320,10 @@ static inline int set_geometry(unsigned int cmd, struct floppy_struct *g,
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
mutex_lock(&open_lock);
- LOCK_FDC(drive, 1);
+ if (lock_fdc(drive, 1)) {
+ mutex_unlock(&open_lock);
+ return -EINTR;
+ }
floppy_type[type] = *g;
floppy_type[type].name = "user format";
for (cnt = type << 2; cnt < (type << 2) + 4; cnt++)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 925efaf..ace998c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -265,10 +265,6 @@ static int dm_blk_open(struct inode *inode, struct file *file)
goto out;
}

- if (test_bit(DMF_FREEING, &md->flags) ||
- test_bit(DMF_DELETING, &md->flags))
- return NULL;
-
dm_get(md);
atomic_inc(&md->open_count);

diff --git a/drivers/net/tulip/interrupt.c b/drivers/net/tulip/interrupt.c
index c6bad98..7faf84f 100644
--- a/drivers/net/tulip/interrupt.c
+++ b/drivers/net/tulip/interrupt.c
@@ -140,6 +140,7 @@ int tulip_poll(struct napi_struct *napi, int budget)
/* If we own the next entry, it is a new packet. Send it up. */
while ( ! (tp->rx_ring[entry].status & cpu_to_le32(DescOwned))) {
s32 status = le32_to_cpu(tp->rx_ring[entry].status);
+ short pkt_len;

if (tp->dirty_rx + RX_RING_SIZE == tp->cur_rx)
break;
@@ -151,8 +152,28 @@ int tulip_poll(struct napi_struct *napi, int budget)
if (++work_done >= budget)
goto not_done;

- if ((status & 0x38008300) != 0x0300) {
- if ((status & 0x38000300) != 0x0300) {
+ /*
+ * Omit the four octet CRC from the length.
+ * (May not be considered valid until we have
+ * checked status for RxLengthOver2047 bits)
+ */
+ pkt_len = ((status >> 16) & 0x7ff) - 4;
+
+ /*
+ * Maximum pkt_len is 1518 (1514 + vlan header)
+ * Anything higher than this is always invalid
+ * regardless of RxLengthOver2047 bits
+ */
+
+ if ((status & (RxLengthOver2047 |
+ RxDescCRCError |
+ RxDescCollisionSeen |
+ RxDescRunt |
+ RxDescDescErr |
+ RxWholePkt)) != RxWholePkt
+ || pkt_len > 1518) {
+ if ((status & (RxLengthOver2047 |
+ RxWholePkt)) != RxWholePkt) {
/* Ingore earlier buffers. */
if ((status & 0xffff) != 0x7fff) {
if (tulip_debug > 1)
@@ -161,30 +182,23 @@ int tulip_poll(struct napi_struct *napi, int budget)
dev->name, status);
tp->stats.rx_length_errors++;
}
- } else if (status & RxDescFatalErr) {
+ } else {
/* There was a fatal error. */
if (tulip_debug > 2)
printk(KERN_DEBUG "%s: Receive error, Rx status %8.8x.\n",
dev->name, status);
tp->stats.rx_errors++; /* end of a packet.*/
- if (status & 0x0890) tp->stats.rx_length_errors++;
+ if (pkt_len > 1518 ||
+ (status & RxDescRunt))
+ tp->stats.rx_length_errors++;
+
if (status & 0x0004) tp->stats.rx_frame_errors++;
if (status & 0x0002) tp->stats.rx_crc_errors++;
if (status & 0x0001) tp->stats.rx_fifo_errors++;
}
} else {
- /* Omit the four octet CRC from the length. */
- short pkt_len = ((status >> 16) & 0x7ff) - 4;
struct sk_buff *skb;

-#ifndef final_version
- if (pkt_len > 1518) {
- printk(KERN_WARNING "%s: Bogus packet size of %d (%#x).\n",
- dev->name, pkt_len, pkt_len);
- pkt_len = 1518;
- tp->stats.rx_length_errors++;
- }
-#endif
/* Check if the packet is long enough to accept without copying
to a minimally-sized skbuff. */
if (pkt_len < tulip_rx_copybreak
@@ -357,14 +371,35 @@ static int tulip_rx(struct net_device *dev)
/* If we own the next entry, it is a new packet. Send it up. */
while ( ! (tp->rx_ring[entry].status & cpu_to_le32(DescOwned))) {
s32 status = le32_to_cpu(tp->rx_ring[entry].status);
+ short pkt_len;

if (tulip_debug > 5)
printk(KERN_DEBUG "%s: In tulip_rx(), entry %d %8.8x.\n",
dev->name, entry, status);
if (--rx_work_limit < 0)
break;
- if ((status & 0x38008300) != 0x0300) {
- if ((status & 0x38000300) != 0x0300) {
+
+ /*
+ Omit the four octet CRC from the length.
+ (May not be considered valid until we have
+ checked status for RxLengthOver2047 bits)
+ */
+ pkt_len = ((status >> 16) & 0x7ff) - 4;
+ /*
+ Maximum pkt_len is 1518 (1514 + vlan header)
+ Anything higher than this is always invalid
+ regardless of RxLengthOver2047 bits
+ */
+
+ if ((status & (RxLengthOver2047 |
+ RxDescCRCError |
+ RxDescCollisionSeen |
+ RxDescRunt |
+ RxDescDescErr |
+ RxWholePkt)) != RxWholePkt
+ || pkt_len > 1518) {
+ if ((status & (RxLengthOver2047 |
+ RxWholePkt)) != RxWholePkt) {
/* Ingore earlier buffers. */
if ((status & 0xffff) != 0x7fff) {
if (tulip_debug > 1)
@@ -373,31 +408,22 @@ static int tulip_rx(struct net_device *dev)
dev->name, status);
tp->stats.rx_length_errors++;
}
- } else if (status & RxDescFatalErr) {
+ } else {
/* There was a fatal error. */
if (tulip_debug > 2)
printk(KERN_DEBUG "%s: Receive error, Rx status %8.8x.\n",
dev->name, status);
tp->stats.rx_errors++; /* end of a packet.*/
- if (status & 0x0890) tp->stats.rx_length_errors++;
+ if (pkt_len > 1518 ||
+ (status & RxDescRunt))
+ tp->stats.rx_length_errors++;
if (status & 0x0004) tp->stats.rx_frame_errors++;
if (status & 0x0002) tp->stats.rx_crc_errors++;
if (status & 0x0001) tp->stats.rx_fifo_errors++;
}
} else {
- /* Omit the four octet CRC from the length. */
- short pkt_len = ((status >> 16) & 0x7ff) - 4;
struct sk_buff *skb;

-#ifndef final_version
- if (pkt_len > 1518) {
- printk(KERN_WARNING "%s: Bogus packet size of %d (%#x).\n",
- dev->name, pkt_len, pkt_len);
- pkt_len = 1518;
- tp->stats.rx_length_errors++;
- }
-#endif
-
/* Check if the packet is long enough to accept without copying
to a minimally-sized skbuff. */
if (pkt_len < tulip_rx_copybreak
diff --git a/drivers/net/tulip/tulip.h b/drivers/net/tulip/tulip.h
index 19abbc3..0afa2d4 100644
--- a/drivers/net/tulip/tulip.h
+++ b/drivers/net/tulip/tulip.h
@@ -201,8 +201,38 @@ enum desc_status_bits {
DescStartPkt = 0x20000000,
DescEndRing = 0x02000000,
DescUseLink = 0x01000000,
- RxDescFatalErr = 0x008000,
+
+ /*
+ * Error summary flag is logical or of 'CRC Error', 'Collision Seen',
+ * 'Frame Too Long', 'Runt' and 'Descriptor Error' flags generated
+ * within tulip chip.
+ */
+ RxDescErrorSummary = 0x8000,
+ RxDescCRCError = 0x0002,
+ RxDescCollisionSeen = 0x0040,
+
+ /*
+ * 'Frame Too Long' flag is set if packet length including CRC exceeds
+ * 1518. However, a full sized VLAN tagged frame is 1522 bytes
+ * including CRC.
+ *
+ * The tulip chip does not block oversized frames, and if this flag is
+ * set on a receive descriptor it does not indicate the frame has been
+ * truncated. The receive descriptor also includes the actual length.
+ * Therefore we can safety ignore this flag and check the length
+ * ourselves.
+ */
+ RxDescFrameTooLong = 0x0080,
+ RxDescRunt = 0x0800,
+ RxDescDescErr = 0x4000,
RxWholePkt = 0x00000300,
+ /*
+ * Top three bits of 14 bit frame length (status bits 27-29) should
+ * never be set as that would make frame over 2047 bytes. The Receive
+ * Watchdog flag (bit 4) may indicate the length is over 2048 and the
+ * length field is invalid.
+ */
+ RxLengthOver2047 = 0x38000010
};


diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 3ef4ac0..078bf8b 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -1,9 +1,19 @@
/*
- * Copyright (c) 2006, Intel Corporation.
+ * Copyright ? 2006-2009, Intel Corporation.
*
- * This file is released under the GPLv2.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
*
- * Copyright (C) 2006-2008 Intel Corporation
* Author: Anil S Keshavamurthy <[email protected]>
*/

@@ -123,7 +133,15 @@ move_left:
/* Insert the new_iova into domain rbtree by holding writer lock */
/* Add new node and rebalance tree. */
{
- struct rb_node **entry = &((prev)), *parent = NULL;
+ struct rb_node **entry, *parent = NULL;
+
+ /* If we have 'prev', it's a valid place to start the
+ insertion. Otherwise, start from the root. */
+ if (prev)
+ entry = &prev;
+ else
+ entry = &iovad->rbroot.rb_node;
+
/* Figure out where to put new node */
while (*entry) {
struct iova *this = container_of(*entry,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ae9775d..eeb7e56 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -572,12 +572,10 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
*/
static inline unsigned long round_hint_to_min(unsigned long hint)
{
-#ifdef CONFIG_SECURITY
hint &= PAGE_MASK;
if (((void *)hint != NULL) &&
(hint < mmap_min_addr))
return PAGE_ALIGN(mmap_min_addr);
-#endif
return hint;
}

diff --git a/include/linux/personality.h b/include/linux/personality.h
index a84e9ff..1261208 100644
--- a/include/linux/personality.h
+++ b/include/linux/personality.h
@@ -40,7 +40,10 @@ enum {
* Security-relevant compatibility flags that must be
* cleared upon setuid or setgid exec:
*/
-#define PER_CLEAR_ON_SETID (READ_IMPLIES_EXEC|ADDR_NO_RANDOMIZE)
+#define PER_CLEAR_ON_SETID (READ_IMPLIES_EXEC | \
+ ADDR_NO_RANDOMIZE | \
+ ADDR_COMPAT_LAYOUT | \
+ MMAP_PAGE_ZERO)

/*
* Personality types.
diff --git a/include/linux/security.h b/include/linux/security.h
index 80c4d00..1638afd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2134,6 +2134,8 @@ static inline int security_file_mmap(struct file *file, unsigned long reqprot,
unsigned long addr,
unsigned long addr_only)
{
+ if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ return -EACCES;
return 0;
}

diff --git a/kernel/resource.c b/kernel/resource.c
index 03d796c..87f675a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -741,7 +741,7 @@ static int __init reserve_setup(char *str)
static struct resource reserve[MAXRESERVE];

for (;;) {
- int io_start, io_num;
+ unsigned int io_start, io_num;
int x = reserved;

if (get_option (&str, &io_start) != 2)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6816e6d..1228d65 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1132,7 +1132,6 @@ static struct ctl_table vm_table[] = {
.strategy = &sysctl_jiffies,
},
#endif
-#ifdef CONFIG_SECURITY
{
.ctl_name = CTL_UNNUMBERED,
.procname = "mmap_min_addr",
@@ -1141,7 +1140,6 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &proc_doulongvec_minmax,
},
-#endif
#ifdef CONFIG_NUMA
{
.ctl_name = CTL_UNNUMBERED,
diff --git a/mm/Kconfig b/mm/Kconfig
index 0bd9c2d..07b4ec4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -208,3 +208,21 @@ config VIRT_TO_BUS

config MMU_NOTIFIER
bool
+
+config DEFAULT_MMAP_MIN_ADDR
+ int "Low address space to protect from user allocation"
+ default 4096
+ help
+ This is the portion of low virtual memory which should be protected
+ from userspace allocation. Keeping a user from writing to low pages
+ can help reduce the impact of kernel NULL pointer bugs.
+
+ For most ia64, ppc64 and x86 users with lots of address space
+ a value of 65536 is reasonable and should cause no problems.
+ On arm and other archs it should not be higher than 32768.
+ Programs which use vm86 functionality would either need additional
+ permissions from either the LSM or the capabilities module or have
+ this protection disabled.
+
+ This value can be changed after boot using the
+ /proc/sys/vm/mmap_min_addr tunable.
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ae093e..d330758 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -86,6 +86,9 @@ int sysctl_overcommit_ratio = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0);

+/* amount of vm to protect from userspace access */
+unsigned long mmap_min_addr = CONFIG_DEFAULT_MMAP_MIN_ADDR;
+
/*
* Check that a process has enough memory to allocate a new virtual
* mapping. 0 means there is enough memory for the allocation to
diff --git a/security/Kconfig b/security/Kconfig
index 5592939..38411dd 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -92,28 +92,8 @@ config SECURITY_ROOTPLUG

See <http://www.linuxjournal.com/article.php?sid=6279> for
more information about this module.
-
- If you are unsure how to answer this question, answer N.
-
-config SECURITY_DEFAULT_MMAP_MIN_ADDR
- int "Low address space to protect from user allocation"
- depends on SECURITY
- default 0
- help
- This is the portion of low virtual memory which should be protected
- from userspace allocation. Keeping a user from writing to low pages
- can help reduce the impact of kernel NULL pointer bugs.
-
- For most ia64, ppc64 and x86 users with lots of address space
- a value of 65536 is reasonable and should cause no problems.
- On arm and other archs it should not be higher than 32768.
- Programs which use vm86 functionality would either need additional
- permissions from either the LSM or the capabilities module or have
- this protection disabled.
-
- This value can be changed after boot using the
- /proc/sys/vm/mmap_min_addr tunable.

+ If you are unsure how to answer this question, answer N.

source security/selinux/Kconfig
source security/smack/Kconfig
diff --git a/security/security.c b/security/security.c
index 3a4b4f5..27a315d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -26,9 +26,6 @@ extern void security_fixup_ops(struct security_operations *ops);

struct security_operations *security_ops; /* Initialized to NULL */

-/* amount of vm to protect from userspace access */
-unsigned long mmap_min_addr = CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR;
-
static inline int verify(struct security_operations *ops)
{
/* verify the security_operations structure exists */

2009-07-20 12:00:09

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On 2009-07-20 06:06, Greg KH wrote:
> I'm announcing the release of the 2.6.27.26 kernel. All users of the
> 2.6.27 kernel series are very strongly encouraged to upgrade.
>
> I'll also be replying to this message with a copy of the patch between
> 2.6.27.26 and 2.6.27.27
>
> The updated 2.6.27.y git tree can be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
> and can be browsed at the normal kernel.org git web browser:
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
>
> thanks,

<CUT>
> Linus Torvalds (1):
> Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x

I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4 hangs
during boot and it is caused by this fix.

I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
additional info & .config.

Best regards,

Krzysztof Ol?dzki

2009-07-20 15:23:01

by Greg KH

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On Mon, Jul 20, 2009 at 01:51:33PM +0200, Krzysztof Oledzki wrote:
> On 2009-07-20 06:06, Greg KH wrote:
> >I'm announcing the release of the 2.6.27.26 kernel. All users of the
> >2.6.27 kernel series are very strongly encouraged to upgrade.
> >
> >I'll also be replying to this message with a copy of the patch between
> >2.6.27.26 and 2.6.27.27
> >
> >The updated 2.6.27.y git tree can be found at:
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
> >and can be browsed at the normal kernel.org git web browser:
> > http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
> >
> >thanks,
>
> <CUT>
> >Linus Torvalds (1):
> > Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
>
> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
> hangs during boot and it is caused by this fix.
>
> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
> additional info & .config.

Wierd. Linus, any thoughts? Do we need to check the version of the
compiler that we are using before turning that option off?

thanks,

greg k-h

2009-07-20 16:03:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Greg KH wrote:
>
> > >Linus Torvalds (1):
> > > Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
> >
> > I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
> > hangs during boot and it is caused by this fix.
> >
> > I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
> > additional info & .config.
>
> Wierd. Linus, any thoughts? Do we need to check the version of the
> compiler that we are using before turning that option off?

Dang. That was what we hoped to avoid by not using -fwrapv - we didn't
know which compilers were buggy, although it _looked_ like just 4.1.x (I
have some dim memory of somebody suspecting a 4.2.x version too, but that
may have been just a "we know it should only be needed in 4.3.x, so maybe
we shouldn't use it in 4.2.x either").

But it looks like the "buggy window" for -fno-strict-overflow is
potentially even _larger_ than it ever was with -fwrapv. And I suspect
that in both cases the real problem is that almost nobody ever uses either
optimization flag, so coverage is way way smaller.

For now, I suspect we need to do something like this: go back to -fwrapv
(because it has gotten more testing), but limit it to 4.2.x and newer.

However, it would be really good to figure out _why_ it breaks. It's
probably some specific configuration issue in addition to the particular
kernel vesion (and maybe it even needs some particular hardware to show
the bug - ie a specific driver that triggers it or whatever).

Linus

---
Makefile | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index 79957b3..47a3a1a 100644
--- a/Makefile
+++ b/Makefile
@@ -566,7 +566,7 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)

# disable invalid "can't wrap" optimizations for signed / pointers
-KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow)
+KBUILD_CFLAGS += $(call cc-ifversion, -ge, 0402, -fwrapv)

# revert to pre-gcc-4.4 behaviour of .eh_frame
KBUILD_CFLAGS += $(call cc-option,-fno-dwarf2-cfi-asm)

2009-07-20 22:09:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>
> No problem. Please let me know what should I do to help tracking this issue.

Can you build two kernels: one with -fwrapv, and one with
-fno-strict-overflow, and then verify that

- they are otherwise identical (ie exact same source code, same compiler
etc)

- verify that yes, the -fwrapv kernel works, the other does not. Just to
avoid the confusion that obviously exists with Debian/sid binutils
upgrades that _also_ happens result in nonbootable kernels.

- upload the 'vmlinux' images somewhere (I'm not sure what the limits for
binary attachments are at the kernel bugzilla, but that would be the
logical place)

In fact, it would be nice to have a third "identical" kernel build, except
with neither -fwrapv/-fno-strict-overflow.

Linus

2009-07-20 22:25:57

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Linus Torvalds wrote:

>
>
> On Mon, 20 Jul 2009, Greg KH wrote:
>>
>>>> Linus Torvalds (1):
>>>> Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
>>>
>>> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
>>> hangs during boot and it is caused by this fix.
>>>
>>> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
>>> additional info & .config.
>>
>> Wierd. Linus, any thoughts? Do we need to check the version of the
>> compiler that we are using before turning that option off?
>
> Dang. That was what we hoped to avoid by not using -fwrapv - we didn't
> know which compilers were buggy, although it _looked_ like just 4.1.x (I
> have some dim memory of somebody suspecting a 4.2.x version too, but that
> may have been just a "we know it should only be needed in 4.3.x, so maybe
> we shouldn't use it in 4.2.x either").
>
> But it looks like the "buggy window" for -fno-strict-overflow is
> potentially even _larger_ than it ever was with -fwrapv. And I suspect
> that in both cases the real problem is that almost nobody ever uses either
> optimization flag, so coverage is way way smaller.
>
> For now, I suspect we need to do something like this: go back to -fwrapv
> (because it has gotten more testing), but limit it to 4.2.x and newer.
>
> However, it would be really good to figure out _why_ it breaks. It's
> probably some specific configuration issue in addition to the particular
> kernel vesion (and maybe it even needs some particular hardware to show
> the bug - ie a specific driver that triggers it or whatever).

No problem. Please let me know what should I do to help tracking this
issue.

Best regards,

Krzysztof Ol?dzki

2009-07-20 23:47:39

by Marc Dionne

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On 07/20/2009 06:08 PM, Linus Torvalds wrote:
>
> On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>> No problem. Please let me know what should I do to help tracking this issue.
>
> Can you build two kernels: one with -fwrapv, and one with
> -fno-strict-overflow, and then verify that
>
> - they are otherwise identical (ie exact same source code, same compiler
> etc)
>
> - verify that yes, the -fwrapv kernel works, the other does not. Just to
> avoid the confusion that obviously exists with Debian/sid binutils
> upgrades that _also_ happens result in nonbootable kernels.
>
> - upload the 'vmlinux' images somewhere (I'm not sure what the limits for
> binary attachments are at the kernel bugzilla, but that would be the
> logical place)
>
> In fact, it would be nice to have a third "identical" kernel build, except
> with neither -fwrapv/-fno-strict-overflow.
>
> Linus

I might be seeing a slightly different bug, but in case it's helpful,
the behaviour here on Fedora rawhide with gcc-4.4.0-14.x86_64 and
binutils-2.19.51.0.11-27.fc12.x86_64 is that I get various .o files that
come out as completely empty files (or in one case as a precisely 64K
sized file that gives a "File format not recognized" error"), and the
latest 2.6.31-rc git can't be built at all.

If I replace -fno-strict-overflow with -fwrapv in Makefile everything
builds and runs fine.

Interestingly though, "-fno-strict-overflow -v" also gives me a good
kernel, and comparing the assembly for one of the affected files doesn't
show any difference between -fwrapv and -fno-strict-overflow.

Marc

2009-07-20 23:58:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Marc Dionne wrote:
>
> I might be seeing a slightly different bug, but in case it's helpful, the
> behaviour here on Fedora rawhide with gcc-4.4.0-14.x86_64 and
> binutils-2.19.51.0.11-27.fc12.x86_64 is that I get various .o files that come
> out as completely empty files (or in one case as a precisely 64K sized file
> that gives a "File format not recognized" error"), and the latest 2.6.31-rc
> git can't be built at all.

Hmm. This sounds more like the binutils bug that people had. Sounds like
an assembler bug if the *.o file ends up being empty or at some fixed
size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
since it didn't generate good assembly.

In fact, your behavior sounds like the thing that produces the *.o files
core-dumped or died for other reasons, and had a 64kB buffer that either
got flushed or not. That would explain the "zero or exactly 64kB" size.

It could be ccache too, of course.

> If I replace -fno-strict-overflow with -fwrapv in Makefile everything builds
> and runs fine.

.. and this is just really really odd. If it was the cc1 front-end that
does that with a bad optimization, I'd expect more visible turds. But on
the other hand, if it's the binutils, then I don't see why -fwrapv would
matter. Some front-end options get passed down to the assembler, but I
would definitely not expect -fwrapv/-fno-strict-overflow to be one of
those.

Linus

2009-07-21 00:37:55

by Marc Dionne

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On 07/20/2009 07:56 PM, Linus Torvalds wrote:
>
> On Mon, 20 Jul 2009, Marc Dionne wrote:
>> I might be seeing a slightly different bug, but in case it's helpful, the
>> behaviour here on Fedora rawhide with gcc-4.4.0-14.x86_64 and
>> binutils-2.19.51.0.11-27.fc12.x86_64 is that I get various .o files that come
>> out as completely empty files (or in one case as a precisely 64K sized file
>> that gives a "File format not recognized" error"), and the latest 2.6.31-rc
>> git can't be built at all.
>
> Hmm. This sounds more like the binutils bug that people had. Sounds like
> an assembler bug if the *.o file ends up being empty or at some fixed
> size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
> since it didn't generate good assembly.
>
> In fact, your behavior sounds like the thing that produces the *.o files
> core-dumped or died for other reasons, and had a 64kB buffer that either
> got flushed or not. That would explain the "zero or exactly 64kB" size.
>
> It could be ccache too, of course.

Actually in my case it turns out that it is ccache after all - if I
remove it from the picture everything is fine. If I re-enable it, even
with a clean cache, I get the problem.

It might just be a coincidence that it's triggered by the -fwrapv change.

Marc

2009-07-21 01:03:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Marc Dionne wrote:
> >
> > Hmm. This sounds more like the binutils bug that people had. Sounds like
> > an assembler bug if the *.o file ends up being empty or at some fixed
> > size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
> > since it didn't generate good assembly.
> >
> > In fact, your behavior sounds like the thing that produces the *.o files
> > core-dumped or died for other reasons, and had a 64kB buffer that either
> > got flushed or not. That would explain the "zero or exactly 64kB" size.
> >
> > It could be ccache too, of course.
>
> Actually in my case it turns out that it is ccache after all - if I remove it
> from the picture everything is fine. If I re-enable it, even with a clean
> cache, I get the problem.
>
> It might just be a coincidence that it's triggered by the -fwrapv change.

Ok, so this is getting ridiculous. Do we have _three_ different kernel
issues going on at the same time, all subtly related to tools issues
rather than the kernel source tree itself?

That's just completely bizarre.

So right now we have:

- Krzysztof Oledzki: the only one who so far has really pinpointed it to
the -fwrapv change itself.

It would be good to really double-check that this is not about ccache,
since Marc apparently gets a good kernel without ccache, and -fwrapv
seems to be involved.

- Marc Dionne: ccache getting confused, with 0-byte and 64kB object
files. But why -fwrapv vs -fno-strict-overflow would matter is totally
unclear. Just happenstance? Something silly like overflowing the
length of the ccache argument buffer?

It would be wonderful to figure out what odd issue ccache might have.
The kernel command line isn't _that_ long, but it does end up being
something reasonably monstrous like

gcc -Wp,-MD,kernel/.fork.s.d -nostdinc -isystem
/usr/lib/gcc/x86_64-redhat-linux/4.4.0/include -Iinclude
-I/home/torvalds/v2.6/linux/arch/x86/include -include
include/linux/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes
-Wno-trigraphs -fno-strict-aliasing -fno-common
-Werror-implicit-function-declaration -Wno-format-security
-fno-delete-null-pointer-checks -Os -m64 -march=core2 -mno-red-zone
-mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args
-DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare
-fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
-Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer
-fno-optimize-sibling-calls -Wdeclaration-after-statement
-Wno-pointer-sign -fwrapv -fno-dwarf2-cfi-asm -D"KBUILD_STR(s)=#s"
-D"KBUILD_BASENAME=KBUILD_STR(fork)" -D"KBUILD_MODNAME=KBUILD_STR(fork)"
-fverbose-asm -S -o kernel/fork.s kernel/fork.c

so we are getting into the kilobyte range for it, and mayeb simply the
longer argument made something fail. But other build systems do even
worse things, I'm sure.

- the Debian/sid binutils package failure, solved by downgrading
binutils.

Crazy, crazy.

Linus

2009-07-21 01:07:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Marc Dionne wrote:
> >
> > It could be ccache too, of course.
>
> Actually in my case it turns out that it is ccache after all - if I remove it
> from the picture everything is fine. If I re-enable it, even with a clean
> cache, I get the problem.
>
> It might just be a coincidence that it's triggered by the -fwrapv change.

Btw, do you find any core-files lying around if you enable them before the
build with

ulimit -c unlimited

or similar?

And how did you clean ccache? There's "-c" and then there's "-C".

Linus

2009-07-21 02:38:59

by Marc Dionne

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On 07/20/2009 09:05 PM, Linus Torvalds wrote:
>
> On Mon, 20 Jul 2009, Marc Dionne wrote:
>>> It could be ccache too, of course.
>> Actually in my case it turns out that it is ccache after all - if I remove it
>> from the picture everything is fine. If I re-enable it, even with a clean
>> cache, I get the problem.
>>
>> It might just be a coincidence that it's triggered by the -fwrapv change.
>
> Btw, do you find any core-files lying around if you enable them before the
> build with
>
> ulimit -c unlimited
>
> or similar?
>
> And how did you clean ccache? There's "-c" and then there's "-C".
>
> Linus

Unfortunately I'm not able to reproduce anymore, after clearing
/var/cache/ccache completely (rm -rf). Earlier I had done ccache -C,
which didn't help. I didn't think to copy the contents for more analysis.

So perhaps a combination of some odd ccache state along with changing
gcc, binutils (which I updated today) and the compile flag. Revving
binutils and gcc back and forth didn't reproduce it.

What is odd though is that when I straced a single gcc command line that
produced an empty .o file, it looked normal - a series of successful
writes with the correct amount of data to a temp file, close, unlink .o
file, rename temp file -> .o file. But the resulting file was empty.
Make me wonder if there was something filesystem/caching related to it.

Marc

2009-07-21 06:41:40

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Linus Torvalds wrote:

>
>
> On Mon, 20 Jul 2009, Marc Dionne wrote:
>>>
>>> Hmm. This sounds more like the binutils bug that people had. Sounds like
>>> an assembler bug if the *.o file ends up being empty or at some fixed
>>> size. If it was cc1 that fails, I'd expect to not see an *.o file at all,
>>> since it didn't generate good assembly.
>>>
>>> In fact, your behavior sounds like the thing that produces the *.o files
>>> core-dumped or died for other reasons, and had a 64kB buffer that either
>>> got flushed or not. That would explain the "zero or exactly 64kB" size.
>>>
>>> It could be ccache too, of course.
>>
>> Actually in my case it turns out that it is ccache after all - if I remove it
>> from the picture everything is fine. If I re-enable it, even with a clean
>> cache, I get the problem.
>>
>> It might just be a coincidence that it's triggered by the -fwrapv change.
>
> Ok, so this is getting ridiculous. Do we have _three_ different kernel
> issues going on at the same time, all subtly related to tools issues
> rather than the kernel source tree itself?
>
> That's just completely bizarre.
>
> So right now we have:
>
> - Krzysztof Oledzki: the only one who so far has really pinpointed it to
> the -fwrapv change itself.
>
> It would be good to really double-check that this is not about ccache,
> since Marc apparently gets a good kernel without ccache, and -fwrapv
> seems to be involved.

There is no ccache configured in my systems and the same problem appears
on a different servers (both i386 and x86-64). However, the configs are
very similar and the hardware is nearly identical.

I'm pretty sure the only different between bootable and unbootable kernel
is that fwrapv vs strict-overflow change.

Best regards,

Krzysztof Ol?dzki

2009-07-21 07:09:53

by Krzysztof Olędzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Mon, 20 Jul 2009, Linus Torvalds wrote:

>
>
> On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>>
>> No problem. Please let me know what should I do to help tracking this issue.
>
> Can you build two kernels: one with -fwrapv, and one with
> -fno-strict-overflow, and then verify that
>
> - they are otherwise identical (ie exact same source code, same compiler
> etc)
>
> - verify that yes, the -fwrapv kernel works, the other does not. Just to
> avoid the confusion that obviously exists with Debian/sid binutils
> upgrades that _also_ happens result in nonbootable kernels.
>
> - upload the 'vmlinux' images somewhere (I'm not sure what the limits for
> binary attachments are at the kernel bugzilla, but that would be the
> logical place)
>
> In fact, it would be nice to have a third "identical" kernel build, except
> with neither -fwrapv/-fno-strict-overflow.

OK. Right now I'm building the three kernels you asked for
(fwrapv/fno-strict-overflow/none). I'll test them and upload vmlinux
images.

Best regards,

Krzysztof Ol?dzki

2009-07-21 10:17:25

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Mon, 20 Jul 2009, Linus Torvalds wrote:
>
>>
>>
>> On Mon, 20 Jul 2009, Krzysztof Oledzki wrote:
>>>
>>> No problem. Please let me know what should I do to help tracking this
>>> issue.
>>
>> Can you build two kernels: one with -fwrapv, and one with
>> -fno-strict-overflow, and then verify that
>>
>> - they are otherwise identical (ie exact same source code, same compiler
>> etc)
>>
>> - verify that yes, the -fwrapv kernel works, the other does not. Just to
>> avoid the confusion that obviously exists with Debian/sid binutils
>> upgrades that _also_ happens result in nonbootable kernels.
>>
>> - upload the 'vmlinux' images somewhere (I'm not sure what the limits for
>> binary attachments are at the kernel bugzilla, but that would be the
>> logical place)
>>
>> In fact, it would be nice to have a third "identical" kernel build, except
>> with neither -fwrapv/-fno-strict-overflow.
>
> OK. Right now I'm building the three kernels you asked for
> (fwrapv/fno-strict-overflow/none). I'll test them and upload vmlinux images.

OK, there are three kernels, exactly as you requested:

http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)

# grep -A1 "can't wrap" linux-2.6.27.27-*/Makefile
linux-2.6.27.27-fno-strict-overflow/Makefile:# disable invalid "can't wrap" optimzations for signed / pointers
linux-2.6.27.27-fno-strict-overflow/Makefile-KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow)
--
linux-2.6.27.27-fwrapv/Makefile:# disable invalid "can't wrap" optimzations for signed / pointers
linux-2.6.27.27-fwrapv/Makefile-KBUILD_CFLAGS += $(call cc-option,-fwrapv)
--
linux-2.6.27.27-fnone/Makefile:# disable invalid "can't wrap" optimzations for signed / pointers
linux-2.6.27.27-fnone/Makefile-#KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow)

Plwase note the third one has the KBUILD_CFLAGS commented.

Kernels are identical and are compiled from the same config, on the same
server with gcc-4.2.4, binutils-2.19. There is no ccache installed and the
kernels are not patched with any additonal patches - just vanilla
linux-2.6.27.27.

Screenshot from the hanging kernel (-fno-strict-overflow):
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/2.6.27.27-hang.png

Dmesg from a bootable kernel:
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/dmesg

Best regards,

Krzysztof Ol?dzki

2009-07-21 16:13:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Krzysztof Oledzki wrote:
>
> OK, there are three kernels, exactly as you requested:
>
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)

Perfect.

And interestingly, the "fno-strict-overflow" kernel is actually much
closer to the "fnone" kernel than to the "fwrapv" one. I have some silly
scripts based on 'objdump -d' plus a lot of stupid sed scripting to remove
the trivial differences due to instruction addresses, and then doing a
'diff -u' between the munged disassembly of the kernels gives me:

[torvalds@nehalem fno-strict-overflow]$ wc -l fnone-to-fno-strict-overflow fwrapv-to-fno-strict-overflow
39309 fnone-to-fno-strict-overflow
91423 fwrapv-to-fno-strict-overflow
130732 total

ie the diff from the kernel with no flags is less than twice the size of
the diff from fwrapv.

Still - it's almost 40kB of diffs of disassembly, so I'm not going to
guarantee that I can make any sense of it and find the compiler problem in
it. But I'll try. And send you test-patches to see if I can pinpoint the
code that causes the problem.

> Kernels are identical and are compiled from the same config, on the same
> server with gcc-4.2.4, binutils-2.19. There is no ccache installed and the
> kernels are not patched with any additonal patches - just vanilla
> linux-2.6.27.27.

Thank you.

> Screenshot from the hanging kernel (-fno-strict-overflow):
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/2.6.27.27-hang.png
>
> Dmesg from a bootable kernel:
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/dmesg

Great. This is all about as perfect as could be asked for. Now it's just a
question of trying to find the right code generation difference...

Linus

2009-07-21 19:17:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
> Great. This is all about as perfect as could be asked for. Now it's just a
> question of trying to find the right code generation difference...

Ok, that "just" is turning out to be really painful.

I've tried to do clever things, but the best I've been able to do is to
get the relevant differences down to about 22 thousand lines of assembler
diffs that don't match either of the working kernels. Sadly, 22KLOC of
assembler diffs isn't something anybody can reasonably read to even start
to make a guess about which lines are causing problems.

So what I'd love to do is to narrow the failure down a bit, by using
-fno-strict-overflow only on _parts_ of the tree and then try a couple of
kernels to see if they hang, to see which part it is that mis-compiles.

With a newer kernel, we could do something like this:

diff --git a/Makefile b/Makefile
index 79957b3..b096be2 100644
--- a/Makefile
+++ b/Makefile
@@ -565,9 +565,6 @@ KBUILD_CFLAGS += $(call cc-option,-Wdeclaration-after-statement,)
# disable pointer signed / unsigned warnings in gcc 4.0
KBUILD_CFLAGS += $(call cc-option,-Wno-pointer-sign,)

-# disable invalid "can't wrap" optimizations for signed / pointers
-KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow)
-
# revert to pre-gcc-4.4 behaviour of .eh_frame
KBUILD_CFLAGS += $(call cc-option,-fno-dwarf2-cfi-asm)

diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..1250b55 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -5,6 +5,8 @@
# Rewritten to use lists instead of if-statements.
#

+subdir-ccflags-y += -fno-strict-overflow
+
obj-y += gpio/
obj-$(CONFIG_PCI) += pci/
obj-$(CONFIG_PARISC) += parisc/

to say "use -fno-strict-overflow only when compiling objects in the
drivers/ subdirectories", but I'm pretty sure that whole clever
'subdir-ccflags-y' thing was added pretty recently, and won't work in
2.6.27

However, since there is _some_ reason to wonder about whether the problem
could be in radeonfb (because the last printouts before the hang are about
that), it would be good to test just that part.

So if you have the time and energy, it would be very interesting if you
could do something like this:

- remove the "KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow)"
entirely from the main Makefile.

- one directory at a time, add

ccflags-y += -fno-strict-overflow

to the Makefile in just that particular directory, and compile and test
the kernel. Now, since your old kernel doesn't have that nifty new
"subdir-ccflags-y" thing, you can't do it for big parts of the kernel,
you can literally do it for just the contents of one subdirectory
(non-recusive!) at a time, but while there's two thousand
subdirectories in the Linux kernel sources, judicious sprinking of that
into the tree could hopefully make it possible to find.

- the first Makefile's to test would be 'drivers/video/aty/Makefile'. If
that one doesn't work, some scripting might be in order, eg something
like

for i in $(find drivers -name Makefile)
do
( echo "ccflags-y += -fno-strict-overflow" ; cat $i ) > $i.new
mv $i.new $i
done

should add it to all the subdirectories under 'drivers', etc.

and if you can find the subdirectory where '-fno-strict-overflow' makes
the difference, at that point I'd love to see the kernel image where
things worked (ie the last kernel you booted successfully _before_ the
kernel that failed) and the kernel that fails - now hopefully the
differences should be much smaller (how small will obviously depend on
whether you caught the difference in just one subdirectory or whether you
scripted it over lots and lots of subdirectories).

Of course, the tighter you can do this, the better. If it happens to be in
'drivers/video/aty/' for example, and you end up being really gung-ho
about this and want to narrow it down to not just the subdirectory, but a
few files, you could remove the per-directory "ccflags-y" line, and do a
few per-file CFLAGS entries instead, like:

CFLAGS_radeon_base.o += -fno-strict-overflow

etc.

And hey, if you think this is too much work, then you're right. It's a lot
of work. So don't worry if you can't be bothered - it would be wonderful
to try to get this thing resolved, but I do realize I'm asking a lot here.

Linus

2009-07-21 21:34:29

by Troy Moure

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:

> > Great. This is all about as perfect as could be asked for. Now it's just a
> > question of trying to find the right code generation difference...
>
> Ok, that "just" is turning out to be really painful.

I think I've found something interesting. Look at the the code generated
for edid_checksum() in driver/video/fbmon.c. This is what I see for the
-fno-strict-overflow kernel:

...
ffffffff803b37ed <edid_checksum>:
ffffffff803b37ed: 53 push %rbx
ffffffff803b37ee: 48 89 fb mov %rdi,%rbx
ffffffff803b37f1: e8 8d fd ff ff callq ffffffff803b3583 <check_edid>
ffffffff803b37f6: 85 c0 test %eax,%eax
ffffffff803b37f8: 89 c6 mov %eax,%esi
ffffffff803b37fa: 74 08 je ffffffff803b3804 <edid_checksum+0x17>
ffffffff803b37fc: 48 89 df mov %rbx,%rdi
ffffffff803b37ff: e8 c0 fe ff ff callq ffffffff803b36c4 <fix_edid>
ffffffff803b3804: eb fe jmp ffffffff803b3804 <edid_checksum+0x17>

ffffffff803b3806 <fb_parse_edid>:
ffffffff803b3806: 41 54 push %r12
ffffffff803b3808: 48 85 ff test %rdi,%rdi
...

That last insn in edid_checksum() doesn't look *quite* right to me...

The -fnone kernel has something a lot more sensible-looking:

ffffffff803b39dd <edid_checksum>:
ffffffff803b39dd: 53 push %rbx
ffffffff803b39de: 48 89 fb mov %rdi,%rbx
ffffffff803b39e1: e8 8d fd ff ff callq ffffffff803b3773 <check_$
ffffffff803b39e6: 85 c0 test %eax,%eax
ffffffff803b39e8: 89 c6 mov %eax,%esi
ffffffff803b39ea: 74 08 je ffffffff803b39f4 <edid_c$
ffffffff803b39ec: 48 89 df mov %rbx,%rdi
ffffffff803b39ef: e8 c0 fe ff ff callq ffffffff803b38b4 <fix_ed$
ffffffff803b39f4: 31 c9 xor %ecx,%ecx
ffffffff803b39f6: 31 f6 xor %esi,%esi
ffffffff803b39f8: 31 d2 xor %edx,%edx
ffffffff803b39fa: eb 0a jmp ffffffff803b3a06 <edid_c$
ffffffff803b39fc: 0f b6 c0 movzbl %al,%eax
ffffffff803b39ff: 8a 04 03 mov (%rbx,%rax,1),%al
ffffffff803b3a02: 01 c1 add %eax,%ecx
ffffffff803b3a04: 09 c6 or %eax,%esi
ffffffff803b3a06: 88 d0 mov %dl,%al
ffffffff803b3a08: ff c2 inc %edx
ffffffff803b3a0a: 81 fa 81 00 00 00 cmp $0x81,%edx
ffffffff803b3a10: 75 ea jne ffffffff803b39fc <edid_c$
ffffffff803b3a12: 84 c9 test %cl,%cl
ffffffff803b3a14: 0f 94 c0 sete %al
ffffffff803b3a17: 40 84 f6 test %sil,%sil
ffffffff803b3a1a: 5b pop %rbx
ffffffff803b3a1b: 0f 95 c2 setne %dl
ffffffff803b3a1e: 21 d0 and %edx,%eax
ffffffff803b3a20: 0f b6 c0 movzbl %al,%eax
ffffffff803b3a23: c3 retq

ffffffff803b3a24 <fb_parse_edid>:
ffffffff803b3a24: 41 54 push %r12
ffffffff803b3a26: 48 85 ff test %rdi,%rdi

Hope that helps narrow things down ;)

Troy

2009-07-22 00:55:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27


[ Added Ian Lance Taylor to the cc, he knows the background, and unlike me
is competent with gcc. ]

On Tue, 21 Jul 2009, Troy Moure wrote:
>
> I think I've found something interesting. Look at the the code generated
> for edid_checksum() in driver/video/fbmon.c. This is what I see for the
> -fno-strict-overflow kernel:

Ooh.

Bingo. You're 100% right, and you definitely found it (of course, there
may be _other_ cases like this, but that's certainly _one_ of the
problems, and probably the only one).

Just out of curiosity, how did you find it? Now that I know where to look,
it's very obvious in the assembler diffs, but I didn't notice it until you
pointed it out just because there is so _much_ of the diffs...

And yes, that's very much a compiler bug. And I also bet it's very easily
fixed.

The code in question is this loop:

#define EDID_LENGTH 128

unsigned char i, ...

for (i = 0; i < EDID_LENGTH; i++) {
csum += edid[i];
all_null |= edid[i];
}

and gcc -fno-strict-overflow has apparently decided that that is an
infinite loop, even though it clearly is not. So then the stupid and buggy
compiler will compile that loop (and the whole rest of the function) to
the "optimized" version that is just

loop:
jmp loop;

I even bet I know why: it looks at "unsigned char", and sees that it is an
8-bit variable, and then it looks at "i < EDID_LENGTH" and sees that it is
a _signed_ comparison (it's signed because the C type rules mean that
'unsigned char' will be extended to 'int' in an expression), and then it
decides that in a signed comparison an 8-bit entry is always going to be
smaller than 128.

Anyway, I bet we can work around the compiler bug by just changing the
type of "i" from "unsigned char" to be a plain "int".

Krzysztof? Mind testing that?

Ian? This is Linux 2.6.27.27 compiled with gcc-4.2.4. I'm not seeing the
bug in the gcc I have on my machine (gcc-4.4.0), but the bug is very clear
(once you _find_ it, which was the problem) in the binaries that Krzysztof
posted. They're still at:

http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)

and you can clearly see the 'edid_checksum' miscompilation in the objdump
disassembly.

Thanks Troy,

Linus

---
Leaving in the context for Ian:

Buggy -fno-strict-overflow compilation:

> ...
> ffffffff803b37ed <edid_checksum>:
> ffffffff803b37ed: 53 push %rbx
> ffffffff803b37ee: 48 89 fb mov %rdi,%rbx
> ffffffff803b37f1: e8 8d fd ff ff callq ffffffff803b3583 <check_edid>
> ffffffff803b37f6: 85 c0 test %eax,%eax
> ffffffff803b37f8: 89 c6 mov %eax,%esi
> ffffffff803b37fa: 74 08 je ffffffff803b3804 <edid_checksum+0x17>
> ffffffff803b37fc: 48 89 df mov %rbx,%rdi
> ffffffff803b37ff: e8 c0 fe ff ff callq ffffffff803b36c4 <fix_edid>
> ffffffff803b3804: eb fe jmp ffffffff803b3804 <edid_checksum+0x17>
>
> ffffffff803b3806 <fb_parse_edid>:
> ffffffff803b3806: 41 54 push %r12
> ffffffff803b3808: 48 85 ff test %rdi,%rdi
> ...
>
> That last insn in edid_checksum() doesn't look *quite* right to me...
>
> The -fnone kernel has something a lot more sensible-looking:
>
> ffffffff803b39dd <edid_checksum>:
> ffffffff803b39dd: 53 push %rbx
> ffffffff803b39de: 48 89 fb mov %rdi,%rbx
> ffffffff803b39e1: e8 8d fd ff ff callq ffffffff803b3773 <check_$
> ffffffff803b39e6: 85 c0 test %eax,%eax
> ffffffff803b39e8: 89 c6 mov %eax,%esi
> ffffffff803b39ea: 74 08 je ffffffff803b39f4 <edid_c$
> ffffffff803b39ec: 48 89 df mov %rbx,%rdi
> ffffffff803b39ef: e8 c0 fe ff ff callq ffffffff803b38b4 <fix_ed$
> ffffffff803b39f4: 31 c9 xor %ecx,%ecx
> ffffffff803b39f6: 31 f6 xor %esi,%esi
> ffffffff803b39f8: 31 d2 xor %edx,%edx
> ffffffff803b39fa: eb 0a jmp ffffffff803b3a06 <edid_c$
> ffffffff803b39fc: 0f b6 c0 movzbl %al,%eax
> ffffffff803b39ff: 8a 04 03 mov (%rbx,%rax,1),%al
> ffffffff803b3a02: 01 c1 add %eax,%ecx
> ffffffff803b3a04: 09 c6 or %eax,%esi
> ffffffff803b3a06: 88 d0 mov %dl,%al
> ffffffff803b3a08: ff c2 inc %edx
> ffffffff803b3a0a: 81 fa 81 00 00 00 cmp $0x81,%edx
> ffffffff803b3a10: 75 ea jne ffffffff803b39fc <edid_c$
> ffffffff803b3a12: 84 c9 test %cl,%cl
> ffffffff803b3a14: 0f 94 c0 sete %al
> ffffffff803b3a17: 40 84 f6 test %sil,%sil
> ffffffff803b3a1a: 5b pop %rbx
> ffffffff803b3a1b: 0f 95 c2 setne %dl
> ffffffff803b3a1e: 21 d0 and %edx,%eax
> ffffffff803b3a20: 0f b6 c0 movzbl %al,%eax
> ffffffff803b3a23: c3 retq
>
> ffffffff803b3a24 <fb_parse_edid>:
> ffffffff803b3a24: 41 54 push %r12
> ffffffff803b3a26: 48 85 ff test %rdi,%rdi
>
> Hope that helps narrow things down ;)
>
> Troy
>

2009-07-22 01:10:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
> Just out of curiosity, how did you find it? Now that I know where to look,
> it's very obvious in the assembler diffs, but I didn't notice it until you
> pointed it out just because there is so _much_ of the diffs...

Ahh. I think I see how you found it. Looking at the diffs, there's only a
few places where the number of instructions changed by a big fraction. And
there's only _one_ place that has a factor-of-three difference (26 lines
in the correct cases, and 7 lines in the wrong one). Clever.

There's also a case in do_page_fault() where -fno-strict-overflow
generates a _lot_ more instructions than the other cases (but not by a
factor of three - but it expands 63 instructions to 100). I'm not seeing
quite _why_ it does that, but it does various stupid things like multiply
by 0x38 etc. But it doesn't look buggy, it just looks stupid.

Or did you just brute-force it and spend a lot of time eyeballing things?

Linus

2009-07-22 01:18:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
> Anyway, I bet we can work around the compiler bug by just changing the
> type of "i" from "unsigned char" to be a plain "int".

IOW, like this.

Linus

---
drivers/video/fbmon.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
index 5c1a2c0..af4a15c 100644
--- a/drivers/video/fbmon.c
+++ b/drivers/video/fbmon.c
@@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)

static int edid_checksum(unsigned char *edid)
{
- unsigned char i, csum = 0, all_null = 0;
- int err = 0, fix = check_edid(edid);
+ unsigned csum = 0, all_null = 0;
+ int i, err = 0, fix = check_edid(edid);

if (fix)
fix_edid(edid, fix);

2009-07-22 06:16:12

by Troy Moure

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:

> > Just out of curiosity, how did you find it? Now that I know where to look,
> > it's very obvious in the assembler diffs, but I didn't notice it until you
> > pointed it out just because there is so _much_ of the diffs...
>
> Ahh. I think I see how you found it. Looking at the diffs, there's only a
> few places where the number of instructions changed by a big fraction. And
> there's only _one_ place that has a factor-of-three difference (26 lines
> in the correct cases, and 7 lines in the wrong one). Clever.

Hmm..that's interesting. But no, I wasn't that clever.

I actually just started poking around the radeonfb code, since you
mentioned it looked like that might be where the issue was. The last
message printed in the hung kernel is "Monitor 2 type no found" - printed
from radeon_probe_screens(). And the first message after that in the
non-hung kernel is "Console: switching to colour frame buffer device",
which I guessed was printed from register_framebuffer() (since that calls
notifiers).

So I started looking in radeon_fb_register() between the call to
radeon_probe_screens() and the call to register_framebuffer(), and tracing
through the calls it made. I ignored pci_, sysfs_, etc. calls, thinking
the driver code was more likely to have a device probing loop or something
odd like that that could be miscompiled.

For any functions that had a loop or anything strange-looking, I checked
the assembler diffs. And after a little while (a half-hour or so, I
think), I found edid_checksum(). Just the name made me think it was a
likely culprit, even before I looked at the diff.

Obviously I got a bit lucky that problem was actually basically where I
started looking for it. But I figured even if I didn't find it, I'd learn
something about the radeonfb code. And who would pass up an opportunity to
learn about that?

Troy

2009-07-22 08:13:46

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>
>> Anyway, I bet we can work around the compiler bug by just changing the
>> type of "i" from "unsigned char" to be a plain "int".
>
> IOW, like this.
>
> Linus
>
> ---
> drivers/video/fbmon.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> index 5c1a2c0..af4a15c 100644
> --- a/drivers/video/fbmon.c
> +++ b/drivers/video/fbmon.c
> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>
> static int edid_checksum(unsigned char *edid)
> {
> - unsigned char i, csum = 0, all_null = 0;
> - int err = 0, fix = check_edid(edid);
> + unsigned csum = 0, all_null = 0;
> + int i, err = 0, fix = check_edid(edid);
>
> if (fix)
> fix_edid(edid, fix);

Wow! You guys rock! ;)

Indeed, this simple change is enough to make my kernel bootable. However,
there is still something wrong. My console is now 80x30 instead of 128x48:

-Console: switching to colour frame buffer device 128x48
+Console: switching to colour frame buffer device 80x30

So, it looks like the loop may be still miscompiled.

The kernel is here:
http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2

Best regards,


Krzysztof Ol?dzki

2009-07-22 08:33:41

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
>>
>>
>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>
>>> Anyway, I bet we can work around the compiler bug by just changing the
>>> type of "i" from "unsigned char" to be a plain "int".
>>
>> IOW, like this.
>>
>> Linus
>>
>> ---
>> drivers/video/fbmon.c | 4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>> index 5c1a2c0..af4a15c 100644
>> --- a/drivers/video/fbmon.c
>> +++ b/drivers/video/fbmon.c
>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>
>> static int edid_checksum(unsigned char *edid)
>> {
>> - unsigned char i, csum = 0, all_null = 0;
>> - int err = 0, fix = check_edid(edid);
>> + unsigned csum = 0, all_null = 0;
>> + int i, err = 0, fix = check_edid(edid);
>>
>> if (fix)
>> fix_edid(edid, fix);
>
> Wow! You guys rock! ;)
>
> Indeed, this simple change is enough to make my kernel bootable. However,
> there is still something wrong. My console is now 80x30 instead of 128x48:
>
> -Console: switching to colour frame buffer device 128x48
> +Console: switching to colour frame buffer device 80x30
>
> So, it looks like the loop may be still miscompiled.
>
> The kernel is here:
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2

OK, by adding a simple debug printk I'm now sure that the problem is
indeed in the loop:

--- linux-2.6.27.27-a/drivers/video/fbmon.c 2009-07-20 05:45:22.000000000 +0200
+++ linux-2.6.27.27-b/drivers/video/fbmon.c 2009-07-22 09:45:34.000000000 +0200
@@ -272,6 +272,8 @@
err = 1;
}

+ printk("edid_checksum debug: csum=%u, all_null=%u, err=%d\n", csum, all_null, err);
+
return err;
}


Here is a diff between a good and a bad kernel:

-edid_checksum debug: csum=0, all_null=255, err=1
-edid_checksum debug: csum=0, all_null=255, err=1
-Console: switching to colour frame buffer device 128x48
+edid_checksum debug: csum=6400, all_null=255, err=0
+Console: switching to colour frame buffer device 80x30

In the good one the function is called twice and it returns err=1 (==OK).
In the bad kernel it returns 0 because csum!=0x00 (==6400).

Best regards,

Krzysztof Ol?dzki

2009-07-22 09:57:43

by Krzysztof Olędzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>
>>
>>
>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>
>>>
>>>
>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>>
>>>> Anyway, I bet we can work around the compiler bug by just changing the
>>>> type of "i" from "unsigned char" to be a plain "int".
>>>
>>> IOW, like this.
>>>
>>> Linus
>>>
>>> ---
>>> drivers/video/fbmon.c | 4 ++--
>>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>>> index 5c1a2c0..af4a15c 100644
>>> --- a/drivers/video/fbmon.c
>>> +++ b/drivers/video/fbmon.c
>>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>>
>>> static int edid_checksum(unsigned char *edid)
>>> {
>>> - unsigned char i, csum = 0, all_null = 0;
>>> - int err = 0, fix = check_edid(edid);
>>> + unsigned csum = 0, all_null = 0;
>>> + int i, err = 0, fix = check_edid(edid);
>>>
>>> if (fix)
>>> fix_edid(edid, fix);
>>
>> Wow! You guys rock! ;)
>>
>> Indeed, this simple change is enough to make my kernel bootable. However,
>> there is still something wrong. My console is now 80x30 instead of 128x48:
>>
>> -Console: switching to colour frame buffer device 128x48
>> +Console: switching to colour frame buffer device 80x30
>>
>> So, it looks like the loop may be still miscompiled.
>>
>> The kernel is here:
>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2
>
> OK, by adding a simple debug printk I'm now sure that the problem is indeed
> in the loop:
>
> --- linux-2.6.27.27-a/drivers/video/fbmon.c 2009-07-20 05:45:22.000000000
> +0200
> +++ linux-2.6.27.27-b/drivers/video/fbmon.c 2009-07-22 09:45:34.000000000
> +0200
> @@ -272,6 +272,8 @@
> err = 1;
> }
>
> + printk("edid_checksum debug: csum=%u, all_null=%u, err=%d\n", csum,
> all_null, err);
> +
> return err;
> }
>
>
> Here is a diff between a good and a bad kernel:
>
> -edid_checksum debug: csum=0, all_null=255, err=1
> -edid_checksum debug: csum=0, all_null=255, err=1
> -Console: switching to colour frame buffer device 128x48
> +edid_checksum debug: csum=6400, all_null=255, err=0
> +Console: switching to colour frame buffer device 80x30
>
> In the good one the function is called twice and it returns err=1 (==OK). In
> the bad kernel it returns 0 because csum!=0x00 (==6400).

More fun.

Actually, the problem is indepenent from -fno-strict-overflow and
also exists with -fwrapv. It is probably caused by "unsigned char i" to "int i"
translation and the variables reordering.

With "unsigned char i" before "unsigned char csum = 0":
edid_checksum loop: edid[0]=0, csum=0
edid_checksum loop: edid[1]=255, csum=0
edid_checksum loop: edid[2]=255, csum=255
-edid_checksum loop: edid[3]=255, csum=254
-edid_checksum loop: edid[4]=255, csum=253
-edid_checksum loop: edid[5]=255, csum=252
-edid_checksum loop: edid[6]=255, csum=251
-edid_checksum loop: edid[7]=0, csum=250
-edid_checksum loop: edid[8]=6, csum=250
-edid_checksum loop: edid[9]=207, csum=0
(...)
-edid_checksum debug: csum=0, all_null=255, err=1 -> OK

With "int i" after "unsigned char csum = 0":
edid_checksum loop: edid[0]=0, csum=0
edid_checksum loop: edid[1]=255, csum=0
edid_checksum loop: edid[2]=255, csum=255
+edid_checksum loop: edid[3]=255, csum=510
+edid_checksum loop: edid[4]=255, csum=765
+edid_checksum loop: edid[5]=255, csum=1020
+edid_checksum loop: edid[6]=255, csum=1275
+edid_checksum loop: edid[7]=0, csum=1530
+edid_checksum loop: edid[8]=6, csum=1530
+edid_checksum loop: edid[9]=207, csum=1536
(...)
+edid_checksum debug: csum=6400, all_null=255, err=0 -> ERROR

Funny. ;)

Best regards,

Krzysztof Ol?dzki

2009-07-22 10:23:46

by Troy Moure

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
> >
> > Indeed, this simple change is enough to make my kernel bootable. However,
> > there is still something wrong. My console is now 80x30 instead of 128x48:
> >
> > -Console: switching to colour frame buffer device 128x48
> > +Console: switching to colour frame buffer device 80x30
> >
> > So, it looks like the loop may be still miscompiled.
> >

Yes. I took a look at the -fixed binary you sent out. edid_checksum() is
now compiled to this (I added some notes on the side):

ffffffff803b37ed: <edid_checksum>:
53 push %rbx
48 89 fb mov %rdi,%rbx %ebx = edid

[... Calls to check_edid and fix_edid ... ]

31 c9 xor %ecx,%ecx csum = 0
31 f6 xor %esi,%esi all_null = 0
31 d2 xor %edx,%edx i = 0
L: 0f b6 04 1a movzbl (%rdx,%rbx,1),%eax %eax = *(i + edid)
48 ff c2 inc %rdx i++
01 c1 add %eax,%ecx csum += %eax
09 c6 or %eax,%esi all_null |= %eax
48 81 fa 80 00 00 00 cmp $0x80,%rdx
75 ec jne L if i != 80 goto L
85 c9 test %ecx,%ecx
0f 94 c0 sete %al %al == (csum == 0)
85 f6 test %esi,%esi
5b pop %rbx
0f 95 c2 setne %dl %dl == (all_null == 0)
21 d0 and %edx,%eax
0f b6 c0 movzbl %al,%eax %eax == (%al && %dl)
c3 retq return %eax

The problem is that csum is stored in %ecx (a 32-bit register) at all
times and is never truncated to a byte. In other words, the compiler is
treating csum like it's an 'int', not an 'unsigned char'.

> Here is a diff between a good and a bad kernel:
>
> -edid_checksum debug: csum=0, all_null=255, err=1
> -edid_checksum debug: csum=0, all_null=255, err=1
> -Console: switching to colour frame buffer device 128x48
> +edid_checksum debug: csum=6400, all_null=255, err=0
> +Console: switching to colour frame buffer device 80x30
>
> In the good one the function is called twice and it returns err=1 (==OK). In
> the bad kernel it returns 0 because csum!=0x00 (==6400).

That makes sense - since csum is being treated like an 'int', it never
wraps, so it just ends up holding the total sum of all the bytes, which
apparently is 6400. Notice that 6400 % 256 == 0, so if it *had* wrapped,
it would have ended up being 0, as expected.

One "fix" might be to just make 'csum' an 'int' (since that's what the
compiler seems to think anyway :p) and do the wrapping by hand (patch
below, if you want to try this).

However, I wouldn't be surprised if other kernel functions are also being
miscompiled. It seems to me that any function that does arithmetic on
'unsigned char's and depends on the wrapping behaviour could potentially
be broken...

Best regards,

Troy

diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
index 5c1a2c0..6802b4c 100644
--- a/drivers/video/fbmon.c
+++ b/drivers/video/fbmon.c
@@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)

static int edid_checksum(unsigned char *edid)
{
- unsigned char i, csum = 0, all_null = 0;
- int err = 0, fix = check_edid(edid);
+ unsigned char all_null = 0;
+ int i, csum = 0, err = 0, fix = check_edid(edid);

if (fix)
fix_edid(edid, fix);
@@ -267,7 +267,7 @@ static int edid_checksum(unsigned char *edid)
all_null |= edid[i];
}

- if (csum == 0x00 && all_null) {
+ if ((csum & 0xff) == 0x00 && all_null) {
/* checksum passed, everything's good */
err = 1;
}

2009-07-22 10:26:40

by Troy Moure

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Jens Rosenboom wrote:

> > >> --- a/drivers/video/fbmon.c
> > >> +++ b/drivers/video/fbmon.c
> > >> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> > >>
> > >> static int edid_checksum(unsigned char *edid)
> > >> {
> > >> - unsigned char i, csum = 0, all_null = 0;
> > >> - int err = 0, fix = check_edid(edid);
> > >> + unsigned csum = 0, all_null = 0;
>
> I guess Linus shouldn't have deleted the "char" here...
>
> [...]
>

Doh! I missed this - I didn't look closely at his patch.

You can disregard my just-sent message, then...

Troy

2009-07-22 10:41:05

by Dick Streefland

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

Krzysztof Oledzki <[email protected]> wrote:
| Here is a diff between a good and a bad kernel:
|
| -edid_checksum debug: csum=0, all_null=255, err=1
| -edid_checksum debug: csum=0, all_null=255, err=1
| -Console: switching to colour frame buffer device 128x48
| +edid_checksum debug: csum=6400, all_null=255, err=0
| +Console: switching to colour frame buffer device 80x30
|
| In the good one the function is called twice and it returns err=1 (==OK).
| In the bad kernel it returns 0 because csum!=0x00 (==6400).

Linus accidentally dropped the "char" from the declaration of "csum" and
"all_null":

> - unsigned char i, csum = 0, all_null = 0;
> - int err = 0, fix = check_edid(edid);
> + unsigned csum = 0, all_null = 0;
> + int i, err = 0, fix = check_edid(edid);

--
Dick Streefland //// Altium BV
[email protected] (@ @) http://www.altium.com
--------------------------------oOO--(_)--OOo---------------------------

2009-07-22 10:45:26

by Jens Rosenboom

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On Wed, 2009-07-22 at 10:32 +0200, Krzysztof Oledzki wrote:
>
> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>
> >
> >
> > On Tue, 21 Jul 2009, Linus Torvalds wrote:
> >
> >>
> >>
> >> On Tue, 21 Jul 2009, Linus Torvalds wrote:
> >>>
> >>> Anyway, I bet we can work around the compiler bug by just changing the
> >>> type of "i" from "unsigned char" to be a plain "int".
> >>
> >> IOW, like this.
> >>
> >> Linus
> >>
> >> ---
> >> drivers/video/fbmon.c | 4 ++--
> >> 1 files changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> >> index 5c1a2c0..af4a15c 100644
> >> --- a/drivers/video/fbmon.c
> >> +++ b/drivers/video/fbmon.c
> >> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> >>
> >> static int edid_checksum(unsigned char *edid)
> >> {
> >> - unsigned char i, csum = 0, all_null = 0;
> >> - int err = 0, fix = check_edid(edid);
> >> + unsigned csum = 0, all_null = 0;

I guess Linus shouldn't have deleted the "char" here...

[...]

> Here is a diff between a good and a bad kernel:
>
> -edid_checksum debug: csum=0, all_null=255, err=1
> -edid_checksum debug: csum=0, all_null=255, err=1
> -Console: switching to colour frame buffer device 128x48
> +edid_checksum debug: csum=6400, all_null=255, err=0
> +Console: switching to colour frame buffer device 80x30
>
> In the good one the function is called twice and it returns err=1 (==OK).
> In the bad kernel it returns 0 because csum!=0x00 (==6400).

6400 is a multiple of 256, so as unsigned char it is ==0.

2009-07-22 10:46:58

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>
>>
>>
>> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>>
>>>
>>>
>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>
>>>>
>>>>
>>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>>>
>>>>> Anyway, I bet we can work around the compiler bug by just changing the
>>>>> type of "i" from "unsigned char" to be a plain "int".
>>>>
>>>> IOW, like this.
>>>>
>>>> Linus
>>>>
>>>> ---
>>>> drivers/video/fbmon.c | 4 ++--
>>>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>>>> index 5c1a2c0..af4a15c 100644
>>>> --- a/drivers/video/fbmon.c
>>>> +++ b/drivers/video/fbmon.c
>>>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>>>
>>>> static int edid_checksum(unsigned char *edid)
>>>> {
>>>> - unsigned char i, csum = 0, all_null = 0;
>>>> - int err = 0, fix = check_edid(edid);
>>>> + unsigned csum = 0, all_null = 0;
>>>> + int i, err = 0, fix = check_edid(edid);
>>>>
>>>> if (fix)
>>>> fix_edid(edid, fix);
>>>
>>> Wow! You guys rock! ;)
>>>
>>> Indeed, this simple change is enough to make my kernel bootable. However,
>>> there is still something wrong. My console is now 80x30 instead of 128x48:
>>>
>>> -Console: switching to colour frame buffer device 128x48
>>> +Console: switching to colour frame buffer device 80x30
>>>
>>> So, it looks like the loop may be still miscompiled.
>>>
>>> The kernel is here:
>>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow-fixed.bz2
>>
>> OK, by adding a simple debug printk I'm now sure that the problem is indeed
>> in the loop:
>>
>> --- linux-2.6.27.27-a/drivers/video/fbmon.c 2009-07-20 05:45:22.000000000
>> +0200
>> +++ linux-2.6.27.27-b/drivers/video/fbmon.c 2009-07-22 09:45:34.000000000
>> +0200
>> @@ -272,6 +272,8 @@
>> err = 1;
>> }
>>
>> + printk("edid_checksum debug: csum=%u, all_null=%u, err=%d\n", csum,
>> all_null, err);
>> +
>> return err;
>> }
>>
>>
>> Here is a diff between a good and a bad kernel:
>>
>> -edid_checksum debug: csum=0, all_null=255, err=1
>> -edid_checksum debug: csum=0, all_null=255, err=1
>> -Console: switching to colour frame buffer device 128x48
>> +edid_checksum debug: csum=6400, all_null=255, err=0
>> +Console: switching to colour frame buffer device 80x30
>>
>> In the good one the function is called twice and it returns err=1 (==OK).
>> In the bad kernel it returns 0 because csum!=0x00 (==6400).
>
> More fun.
>
> Actually, the problem is indepenent from -fno-strict-overflow and also exists
> with -fwrapv. It is probably caused by "unsigned char i" to "int i"
> translation and the variables reordering.
>
> With "unsigned char i" before "unsigned char csum = 0":
> edid_checksum loop: edid[0]=0, csum=0
> edid_checksum loop: edid[1]=255, csum=0
> edid_checksum loop: edid[2]=255, csum=255
> -edid_checksum loop: edid[3]=255, csum=254
> -edid_checksum loop: edid[4]=255, csum=253
> -edid_checksum loop: edid[5]=255, csum=252
> -edid_checksum loop: edid[6]=255, csum=251
> -edid_checksum loop: edid[7]=0, csum=250
> -edid_checksum loop: edid[8]=6, csum=250
> -edid_checksum loop: edid[9]=207, csum=0
> (...)
> -edid_checksum debug: csum=0, all_null=255, err=1 -> OK
>
> With "int i" after "unsigned char csum = 0":
> edid_checksum loop: edid[0]=0, csum=0
> edid_checksum loop: edid[1]=255, csum=0
> edid_checksum loop: edid[2]=255, csum=255
> +edid_checksum loop: edid[3]=255, csum=510
> +edid_checksum loop: edid[4]=255, csum=765
> +edid_checksum loop: edid[5]=255, csum=1020
> +edid_checksum loop: edid[6]=255, csum=1275
> +edid_checksum loop: edid[7]=0, csum=1530
> +edid_checksum loop: edid[8]=6, csum=1530
> +edid_checksum loop: edid[9]=207, csum=1536
> (...)
> +edid_checksum debug: csum=6400, all_null=255, err=0 -> ERROR
>
> Funny. ;)

OK. I'm just blind. The oryginal patch is wrong. Instead of:

- unsigned char i, csum = 0, all_null = 0;
- int err = 0, fix = check_edid(edid);
+ unsigned csum = 0, all_null = 0;
+ int i, err = 0, fix = check_edid(edid);


It should be:

- unsigned char i, csum = 0, all_null = 0;
- int err = 0, fix = check_edid(edid);
+ unsigned char csum = 0, all_null = 0;
+ int i, err = 0, fix = check_edid(edid);

The patch should not change "unsigned char (...) csum" to "unsigned (...) csum".

Best regards,

Krzysztof Ol?dzki

2009-07-22 11:28:04

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Jens Rosenboom wrote:

> On Wed, 2009-07-22 at 10:32 +0200, Krzysztof Oledzki wrote:
>>
>> On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
>>
>>>
>>>
>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>
>>>>
>>>>
>>>> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>>>>
>>>>> Anyway, I bet we can work around the compiler bug by just changing the
>>>>> type of "i" from "unsigned char" to be a plain "int".
>>>>
>>>> IOW, like this.
>>>>
>>>> Linus
>>>>
>>>> ---
>>>> drivers/video/fbmon.c | 4 ++--
>>>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
>>>> index 5c1a2c0..af4a15c 100644
>>>> --- a/drivers/video/fbmon.c
>>>> +++ b/drivers/video/fbmon.c
>>>> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>>>>
>>>> static int edid_checksum(unsigned char *edid)
>>>> {
>>>> - unsigned char i, csum = 0, all_null = 0;
>>>> - int err = 0, fix = check_edid(edid);
>>>> + unsigned csum = 0, all_null = 0;
>
> I guess Linus shouldn't have deleted the "char" here...

Yep! However, it took me one more hour to find it. I should have read my
mails more often. ;)

Thanks.

Best regards,

Krzysztof Ol?dzki

2009-07-22 11:51:51

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:

>
> [ Added Ian Lance Taylor to the cc, he knows the background, and unlike me
> is competent with gcc. ]
>
> On Tue, 21 Jul 2009, Troy Moure wrote:
>>
>> I think I've found something interesting. Look at the the code generated
>> for edid_checksum() in driver/video/fbmon.c. This is what I see for the
>> -fno-strict-overflow kernel:
>
> Ooh.
>
> Bingo. You're 100% right, and you definitely found it (of course, there
> may be _other_ cases like this, but that's certainly _one_ of the
> problems, and probably the only one).
>
> Just out of curiosity, how did you find it? Now that I know where to look,
> it's very obvious in the assembler diffs, but I didn't notice it until you
> pointed it out just because there is so _much_ of the diffs...
>
> And yes, that's very much a compiler bug. And I also bet it's very easily
> fixed.
>
> The code in question is this loop:
>
> #define EDID_LENGTH 128
>
> unsigned char i, ...
>
> for (i = 0; i < EDID_LENGTH; i++) {
> csum += edid[i];
> all_null |= edid[i];
> }
>
> and gcc -fno-strict-overflow has apparently decided that that is an
> infinite loop, even though it clearly is not. So then the stupid and buggy
> compiler will compile that loop (and the whole rest of the function) to
> the "optimized" version that is just
>
> loop:
> jmp loop;
>
> I even bet I know why: it looks at "unsigned char", and sees that it is an
> 8-bit variable, and then it looks at "i < EDID_LENGTH" and sees that it is
> a _signed_ comparison (it's signed because the C type rules mean that
> 'unsigned char' will be extended to 'int' in an expression), and then it
> decides that in a signed comparison an 8-bit entry is always going to be
> smaller than 128.
>
> Anyway, I bet we can work around the compiler bug by just changing the
> type of "i" from "unsigned char" to be a plain "int".
>
> Krzysztof? Mind testing that?
>
> Ian? This is Linux 2.6.27.27 compiled with gcc-4.2.4. I'm not seeing the
> bug in the gcc I have on my machine (gcc-4.4.0), but the bug is very clear
> (once you _find_ it, which was the problem) in the binaries that Krzysztof
> posted. They're still at:
>
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2 (Hangs)
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2 (OK)
> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2 (OK)
>
> and you can clearly see the 'edid_checksum' miscompilation in the objdump
> disassembly.

BTW: here is a simple testcase for this bug:

--- fno-strict-overflow-fixed-bug.c ---
#include <stdio.h>

int main() {

unsigned char i;

for (i = 0; i < 128; i++)
printf("loop %u\n", i);

return 0;
}
--- cut here ---

The code should be compiled with:
cc -o fno-strict-overflow-fixed-bug -Os -fno-strict-overflow fno-strict-overflow-fixed-bug.c
or:
cc -o fno-strict-overflow-fixed-bug -O2 -fno-strict-overflow fno-strict-overflow-fixed-bug.c

This bug does not exist with -O1 or if the loop is controlled by "i < 127"
or "i < 129".

So, we should make sure there is no
unsigned char i; (...) for (i = 0; i < 128; i++)
somewhere inside the kernel.

Best regards,

Krzysztof Ol?dzki

Subject: Re: Linux 2.6.27.27

On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:
> BTW: here is a simple testcase for this bug:
>
> --- fno-strict-overflow-fixed-bug.c ---
> #include <stdio.h>
>
> int main() {
>
> unsigned char i;
>
> for (i = 0; i < 128; i++)
> printf("loop %u\n", i);
>
> return 0;
> }
> --- cut here ---
>
> The code should be compiled with:
> cc -o fno-strict-overflow-fixed-bug -Os -fno-strict-overflow fno-strict-overflow-fixed-bug.c
> or:
> cc -o fno-strict-overflow-fixed-bug -O2 -fno-strict-overflow fno-strict-overflow-fixed-bug.c
>
> This bug does not exist with -O1 or if the loop is controlled by "i
> < 127" or "i < 129".

Thank you.

Debian stable, ia32, gcc 4.2.4-6: BUG
Debian stable, ia32, gcc 4.3.2-1.1: no bug (default)

Debian unstable, ia32, gcc 4.2.4-6: BUG
Debian unstable, ia32, gcc 4.3.3-14: no bug (default)

Someone else already tested gcc 4.4.0, as no bug.

I will file bugs against Debian gcc-4.2...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-07-22 13:46:56

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>
>>
>> [ Added Ian Lance Taylor to the cc, he knows the background, and unlike me
>> is competent with gcc. ]
>>
>> On Tue, 21 Jul 2009, Troy Moure wrote:
>>>
>>> I think I've found something interesting. Look at the the code generated
>>> for edid_checksum() in driver/video/fbmon.c. This is what I see for the
>>> -fno-strict-overflow kernel:
>>
>> Ooh.
>>
>> Bingo. You're 100% right, and you definitely found it (of course, there
>> may be _other_ cases like this, but that's certainly _one_ of the
>> problems, and probably the only one).
>>
>> Just out of curiosity, how did you find it? Now that I know where to look,
>> it's very obvious in the assembler diffs, but I didn't notice it until you
>> pointed it out just because there is so _much_ of the diffs...
>>
>> And yes, that's very much a compiler bug. And I also bet it's very easily
>> fixed.
>>
>> The code in question is this loop:
>>
>> #define EDID_LENGTH 128
>>
>> unsigned char i, ...
>>
>> for (i = 0; i < EDID_LENGTH; i++) {
>> csum += edid[i];
>> all_null |= edid[i];
>> }
>>
>> and gcc -fno-strict-overflow has apparently decided that that is an
>> infinite loop, even though it clearly is not. So then the stupid and buggy
>> compiler will compile that loop (and the whole rest of the function) to
>> the "optimized" version that is just
>>
>> loop:
>> jmp loop;
>>
>> I even bet I know why: it looks at "unsigned char", and sees that it is an
>> 8-bit variable, and then it looks at "i < EDID_LENGTH" and sees that it is
>> a _signed_ comparison (it's signed because the C type rules mean that
>> 'unsigned char' will be extended to 'int' in an expression), and then it
>> decides that in a signed comparison an 8-bit entry is always going to be
>> smaller than 128.
>>
>> Anyway, I bet we can work around the compiler bug by just changing the
>> type of "i" from "unsigned char" to be a plain "int".
>>
>> Krzysztof? Mind testing that?
>>
>> Ian? This is Linux 2.6.27.27 compiled with gcc-4.2.4. I'm not seeing the
>> bug in the gcc I have on my machine (gcc-4.4.0), but the bug is very clear
>> (once you _find_ it, which was the problem) in the binaries that Krzysztof
>> posted. They're still at:
>>
>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fno-strict-overflow.bz2
>> (Hangs)
>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fwrapv.bz2
>> (OK)
>> http://noc.axelspringer.pl/no-strict-overflow-vs-wrapv/vmlinux-fnone.bz2
>> (OK)
>>
>> and you can clearly see the 'edid_checksum' miscompilation in the objdump
>> disassembly.
>
> BTW: here is a simple testcase for this bug:
>
> --- fno-strict-overflow-fixed-bug.c ---
> #include <stdio.h>
>
> int main() {
>
> unsigned char i;
>
> for (i = 0; i < 128; i++)
> printf("loop %u\n", i);
>
> return 0;
> }
> --- cut here ---

Or this better one (no infinite loop):

--- cut here ---
#include <stdio.h>

int main() {

unsigned char i, j=0;

for (i = 0; i <= 127; i++) {

if (!i && j++) {
printf("Buggy GCC\n");
return 1;
}
}

printf("GCC is OK\n");

return 0;
}
--- cut here ---

> The code should be compiled with:
> cc -o fno-strict-overflow-fixed-bug -Os -fno-strict-overflow
> fno-strict-overflow-fixed-bug.c
> or:
> cc -o fno-strict-overflow-fixed-bug -O2 -fno-strict-overflow
> fno-strict-overflow-fixed-bug.c
>
> This bug does not exist with -O1 or if the loop is controlled by "i < 127" or
> "i < 129".
>
> So, we should make sure there is no
> unsigned char i; (...) for (i = 0; i < 128; i++)
> somewhere inside the kernel.


Best regards,

Krzysztof Ol?dzki

2009-07-22 13:49:52

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Tue, 21 Jul 2009, Linus Torvalds wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
>>
>> Anyway, I bet we can work around the compiler bug by just changing the
>> type of "i" from "unsigned char" to be a plain "int".
>
> IOW, like this.
>
> Linus
>
> ---
> drivers/video/fbmon.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> index 5c1a2c0..af4a15c 100644
> --- a/drivers/video/fbmon.c
> +++ b/drivers/video/fbmon.c
> @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
>
> static int edid_checksum(unsigned char *edid)
> {
> - unsigned char i, csum = 0, all_null = 0;
> - int err = 0, fix = check_edid(edid);
> + unsigned csum = 0, all_null = 0;
> + int i, err = 0, fix = check_edid(edid);
>
> if (fix)
> fix_edid(edid, fix);

Tested-by: Krzysztof Piotr Oledzki <[email protected]>

On condition, that you keep "unsigned char" here. ;)

Best regards,

Krzysztof Ol?dzki

2009-07-22 15:36:35

by Ian Lance Taylor

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

Krzysztof Oledzki <[email protected]> writes:

> BTW: here is a simple testcase for this bug:
>
> --- fno-strict-overflow-fixed-bug.c ---
> #include <stdio.h>
>
> int main() {
>
> unsigned char i;
>
> for (i = 0; i < 128; i++)
> printf("loop %u\n", i);
>
> return 0;
> }
> --- cut here ---

I can confirm that this is broken in gcc 4.2 using -fno-strict-overflow,
and works correctly in gcc 4.3 and later. Sigh.

Ian

2009-07-22 15:50:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Krzysztof Oledzki wrote:

>
>
> On Tue, 21 Jul 2009, Linus Torvalds wrote:
> >
> > IOW, like this.

Yeah, I'm a moron. Not at _all_ like that. I wish I hadn't sent out the
patch, you'd have done it correctly by hand from my description.

> > diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> > index 5c1a2c0..af4a15c 100644
> > --- a/drivers/video/fbmon.c
> > +++ b/drivers/video/fbmon.c
> > @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> >
> > static int edid_checksum(unsigned char *edid)
> > {
> > - unsigned char i, csum = 0, all_null = 0;
> > - int err = 0, fix = check_edid(edid);
> > + unsigned csum = 0, all_null = 0;
> > + int i, err = 0, fix = check_edid(edid);

I don't know where the 'char' disappeared, but that was obviously not
intended. I just meant to move the 'i' from one line to the other.

> >
> > if (fix)
> > fix_edid(edid, fix);
>
> Tested-by: Krzysztof Piotr Oledzki <[email protected]>
>
> On condition, that you keep "unsigned char" here. ;)

Indeed. I'll commit the fixed version. Thanks for testing and sorry for
the idiot patch.

Linus

2009-07-22 16:00:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 22 Jul 2009, Troy Moure wrote:
>
> Obviously I got a bit lucky that problem was actually basically where I
> started looking for it. But I figured even if I didn't find it, I'd learn
> something about the radeonfb code. And who would pass up an opportunity to
> learn about that?

Who indeed? <eyeroll> There are few things more interesting than looking
at assembler code from the frame buffer layer. </eyeroll>

Linus

2009-07-23 17:34:50

by Krzysztof Olędzki

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On 2009-07-20 17:10, Greg KH wrote:
> On Mon, Jul 20, 2009 at 01:51:33PM +0200, Krzysztof Oledzki wrote:
>> On 2009-07-20 06:06, Greg KH wrote:
>>> I'm announcing the release of the 2.6.27.26 kernel. All users of the
>>> 2.6.27 kernel series are very strongly encouraged to upgrade.
>>>
>>> I'll also be replying to this message with a copy of the patch between
>>> 2.6.27.26 and 2.6.27.27
>>>
>>> The updated 2.6.27.y git tree can be found at:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
>>> and can be browsed at the normal kernel.org git web browser:
>>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
>>>
>>> thanks,
>> <CUT>
>>> Linus Torvalds (1):
>>> Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
>> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
>> hangs during boot and it is caused by this fix.
>>
>> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
>> additional info & .config.
>
> Wierd. Linus, any thoughts? Do we need to check the version of the
> compiler that we are using before turning that option off?

This bug is now fixed by
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3730793d457fed79a6d49bae72996d458c8e4f2d

Could you please include this patch in the next -stable releases?

BTW: the problem exists in gcc-4.2.4 not gcc-2.4.2 so I think it is
worth to change the subject when adding the patch. ;)

Best regards,

Krzysztof Ol?dzki

2009-07-24 21:27:47

by Greg KH

[permalink] [raw]
Subject: Re: Linux 2.6.27.27

On Thu, Jul 23, 2009 at 07:33:23PM +0200, Krzysztof Olędzki wrote:
> On 2009-07-20 17:10, Greg KH wrote:
> > On Mon, Jul 20, 2009 at 01:51:33PM +0200, Krzysztof Oledzki wrote:
> >> On 2009-07-20 06:06, Greg KH wrote:
> >>> I'm announcing the release of the 2.6.27.26 kernel. All users of the
> >>> 2.6.27 kernel series are very strongly encouraged to upgrade.
> >>>
> >>> I'll also be replying to this message with a copy of the patch between
> >>> 2.6.27.26 and 2.6.27.27
> >>>
> >>> The updated 2.6.27.y git tree can be found at:
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.27.y.git
> >>> and can be browsed at the normal kernel.org git web browser:
> >>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=summary
> >>>
> >>> thanks,
> >> <CUT>
> >>> Linus Torvalds (1):
> >>> Don't use '-fwrapv' compiler option: it's buggy in gcc-4.1.x
> >> I'm very sorry to say that but 2.6.27.27 compiled with gcc-4.2.4
> >> hangs during boot and it is caused by this fix.
> >>
> >> I extended http://bugzilla.kernel.org/show_bug.cgi?id=13012 with
> >> additional info & .config.
> >
> > Wierd. Linus, any thoughts? Do we need to check the version of the
> > compiler that we are using before turning that option off?
>
> This bug is now fixed by
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3730793d457fed79a6d49bae72996d458c8e4f2d
>
> Could you please include this patch in the next -stable releases?

Yes, I have it queued up to go out today :)

> BTW: the problem exists in gcc-4.2.4 not gcc-2.4.2 so I think it is
> worth to change the subject when adding the patch. ;)

Hm, I'll see about that.

thanks,

greg k-h

2009-07-29 14:57:21

by Pavel Machek

[permalink] [raw]
Subject: Re: Linux 2.6.27.27


> > > diff --git a/drivers/video/fbmon.c b/drivers/video/fbmon.c
> > > index 5c1a2c0..af4a15c 100644
> > > --- a/drivers/video/fbmon.c
> > > +++ b/drivers/video/fbmon.c
> > > @@ -256,8 +256,8 @@ static void fix_edid(unsigned char *edid, int fix)
> > >
> > > static int edid_checksum(unsigned char *edid)
> > > {
> > > - unsigned char i, csum = 0, all_null = 0;
> > > - int err = 0, fix = check_edid(edid);
> > > + unsigned csum = 0, all_null = 0;
> > > + int i, err = 0, fix = check_edid(edid);
>
> I don't know where the 'char' disappeared, but that was obviously not
> intended. I just meant to move the 'i' from one line to the other.
>
> > >
> > > if (fix)
> > > fix_edid(edid, fix);
> >
> > Tested-by: Krzysztof Piotr Oledzki <[email protected]>
> >
> > On condition, that you keep "unsigned char" here. ;)
>
> Indeed. I'll commit the fixed version. Thanks for testing and sorry for
> the idiot patch.

So... we are going to just work around the gcc bug in the kernel?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-07-29 16:01:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 2.6.27.27



On Wed, 29 Jul 2009, Pavel Machek wrote:
>
> So... we are going to just work around the gcc bug in the kernel?

Well, the gcc people hopefully will fix it in the 4.2.4 tree too.

Also, it's not exactly the first time we work around compiler bugs. We've
done it before, I'm sure we'll do it again.

In this case, the work-around is trivial, and in many ways makes the code
more "normal" (it's just a loop counter, might as well use an "int" for
it), so there are no downsides to it.

We could disallow gcc-4.2.4 entirely, of course, and have a big "this
compiler is known to generate broken code" message and refuse to compile
the kernel with it, but while that would be a "safer" approach, it would
be rather user-unfriendly.

Compiler bugs happen. They're really really annoying, and nasty to track
down. But they aren't the end of the world, and the pattern of this
particular bug doesn't seem like it would likely trigger anywhere else.

For example, I suspect it really does need that loop induction variable to
be 'unsigned char', and it really needs that limit to be exactly 128. It
looks like a combination of loop optimization and broken range logic. I
doubt it would hit in some random code that just happens to compare an
unsigned char against 128 in general.

And using 'unsigned char' as a lop induction variable is _very_ rare.
Which is probably why the gcc bug happened in the first place - no
testing.

Linus