To: linux-kernel@vger.kernel.org
Subject: amd64 bitops fix for -Os
From: Alexandre Oliva <aoliva@redhat.com>
Organization: Red Hat OS Tools Group
Date: Sat, 29 Oct 2005 19:56:02 -0200
Message-ID: <orvezggf7x.fsf@livre.oliva.athome.lsd.ic.unicamp.br>
User-Agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.4.17 (Jumbo Shrimp, linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8729
Lines: 222

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171672

This patches fixes a bug that comes up when compiling the kernel for
x86_64 optimizing for size.  It affects 2.6.14 for sure, but I'm
pretty sure many earlier kernels are affected as well.

The symptom is that, as soon as some change is made to the root
filesystem (e.g. dmesg > /var/log/dmesg), the kernel mostly hangs.  It
was not the first time I'd run into this symptom, but this time I
could track the problem down to enabling size optimizations in the
kernel build.  It took some time to narrow down the culprit source
with a binary search, compiling part of the kernel sources with -Os
and part with -O2, but eventually it was clear that bitops itself was
to blame, which should have been clear from the soft lockup oops I
got.

The problem is that find_first_zero_bit() fails when called with an
underflown size, because its inline asm assumes at least one iteration
of scasq will run.  When this does not hold, the conditional branch
that follows it uses flags from instructions prior to the asm
statement.

When optimizing for speed, the generated code is such that the flags
will have the correct value, because of the side effects on flags of
the right shift of the size, that survive through to the asm
statement.  When optimizing for size, however, the mov instruction
used to initialize %rax with -1 is replaced with a smaller or
instruction, that modifies the flags and thus breaks the
zero-trip-count case.

Obviously the asm statement must not rely on the compiler setting up
flags by chance, so we have to either force the flags to be set
properly or make sure we run scasq at least once.  In teh
find_first_zero_bit case, this comes at pretty much no cost, since we
already test size for non-zero, but we used to do that adjusting it
from bits to words; changing it should have no visible effect on
performance.

As for find_first_bit, it's quite likely that the same bug is present
when it's called by find_next_bit in the same conditions, but
find_first_bit doesn't even test for zero.  AFAICT, it has just been
luckier, so I went ahead and added the same guard code to it.  This
unfortunately adds a test to the fast path, but I don't see how to
avoid that without auditing all callers.

I actually introduce means to guard against these cases in the public
wrappers, but the BUG_ONs are disabled by default.  I've left a kernel
running with them enabled for a bit, and they never hit, which is a
good sign, but I haven't tested it thoroughly or anything.  We could
probably do away with these new tests by modifying the find_next*bit
functions so as to not call the find_first*bit functions if they've
already exhausted the range implied by the size argument.  I'm not
sure whether that's worth doing, though, so I didn't.

While staring at the code and trying to figure out what the problem
was, I removed some needless casts from find_next_zero_bit, by
constifying the automatic pointer properly, and also moved the actual
code from find_first_zero_bit to a separate internal function, such
that we could add the bug-check to the public interface only.

I also noticed find_first_zero_bit was less efficient than
find_first_bit in that the former saved and restored rbx, because GCC
chose that to hold (addr) within the asm statement, instead of using
the readily-available and caller-saved rsi.  I've thus changed the
code to prefer rsi, although in a perfect world the compiler would be
able to figure that out by itself.

The compiler could do a bit better in find_first_zero_bit: if the
initial size turns out to be zero, it could return, like it does in
find_first_bit, but instead of sets rdx to zero and jumps to the end
of the function where rdx is copied to rax before the return
statement.  This is a negative effect of the assignment of variable
res to rdx instead of rax, which gets the register allocator to map
the pseudo register representing the return value to rdx, requiring a
copy at the end and preventing (as far as the dumb compiler can see
:-) the direct use of a return in the zero-size case.  I've verified
that this is not caused by the additional inline function that I
introduced.

I tried to change the use of registers so as to enable the better code
for this path, but I couldn't come up with anything that was as
efficient, so I figured I wouldn't try to optimize the exceptional
path in expense of the common fast path and left it alone.  If anyone
can come up with something better, please go ahead.


Anyhow, with this patch I could run 2.6.14, as in the Fedora
development tree, except for the change to optimize for size.

	Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>

--- arch/x86_64/lib/bitops.c~	2005-10-27 22:02:08.000000000 -0200
+++ arch/x86_64/lib/bitops.c	2005-10-29 18:24:27.000000000 -0200
@@ -1,5 +1,11 @@
 #include <linux/bitops.h>
 
+#define BITOPS_CHECK_UNDERFLOW_RANGE 0
+
+#if BITOPS_CHECK_UNDERFLOW_RANGE
+# include <linux/kernel.h>
+#endif
+
 #undef find_first_zero_bit
 #undef find_next_zero_bit
 #undef find_first_bit
@@ -13,11 +19,21 @@
  * Returns the bit-number of the first zero bit, not the number of the byte
  * containing a bit.
  */
-inline long find_first_zero_bit(const unsigned long * addr, unsigned long size)
+static inline long
+__find_first_zero_bit(const unsigned long * addr, unsigned long size)
 {
 	long d0, d1, d2;
 	long res;
 
+	/* We must test the size in words, not in bits, because
+	   otherwise incoming sizes in the range -63..-1 will not run
+	   any scasq instructions, and then the flags used by the je
+	   instruction will have whatever random value was in place
+	   before.  Nobody should call us like that, but
+	   find_next_zero_bit() does when offset and size are at the
+	   same word and it fails to find a zero itself.  */
+	size += 63;
+	size >>= 6;
 	if (!size)
 		return 0;
 	asm volatile(
@@ -30,11 +46,22 @@
 		"  shlq $3,%%rdi\n"
 		"  addq %%rdi,%%rdx"
 		:"=d" (res), "=&c" (d0), "=&D" (d1), "=&a" (d2)
-		:"0" (0ULL), "1" ((size + 63) >> 6), "2" (addr), "3" (-1ULL),
-		 [addr] "r" (addr) : "memory");
+		:"0" (0ULL), "1" (size), "2" (addr), "3" (-1ULL),
+		 /* Any register here would do, but GCC tends to
+		    prefer rbx over rsi, even though rsi is readily
+		    available and doesn't have to be saved.  */
+		 [addr] "S" (addr) : "memory");
 	return res;
 }
 
+long find_first_zero_bit(const unsigned long * addr, unsigned long size)
+{
+#if BITOPS_CHECK_UNDERFLOW_RANGE
+	BUG_ON (size + 63 < size);
+#endif
+	return __find_first_zero_bit (addr, size);
+}
+
 /**
  * find_next_zero_bit - find the first zero bit in a memory region
  * @addr: The address to base the search on
@@ -43,7 +70,7 @@
  */
 long find_next_zero_bit (const unsigned long * addr, long size, long offset)
 {
-	unsigned long * p = ((unsigned long *) addr) + (offset >> 6);
+	const unsigned long * p = addr + (offset >> 6);
 	unsigned long set = 0;
 	unsigned long res, bit = offset&63;
 
@@ -63,8 +90,8 @@
 	/*
 	 * No zero yet, search remaining full words for a zero
 	 */
-	res = find_first_zero_bit ((const unsigned long *)p,
-				   size - 64 * (p - (unsigned long *) addr));
+	res = __find_first_zero_bit (p, size - 64 * (p - addr));
+
 	return (offset + set + res);
 }
 
@@ -74,6 +101,17 @@
 	long d0, d1;
 	long res;
 
+	/* We must test the size in words, not in bits, because
+	   otherwise incoming sizes in the range -63..-1 will not run
+	   any scasq instructions, and then the flags used by the jz
+	   instruction will have whatever random value was in place
+	   before.  Nobody should call us like that, but
+	   find_next_bit() does when offset and size are at the same
+	   word and it fails to find a one itself.  */
+	size += 63;
+	size >>= 6;
+	if (!size)
+		return 0;
 	asm volatile(
 		"   repe; scasq\n"
 		"   jz 1f\n"
@@ -83,8 +121,7 @@
 		"   shlq $3,%%rdi\n"
 		"   addq %%rdi,%%rax"
 		:"=a" (res), "=&c" (d0), "=&D" (d1)
-		:"0" (0ULL),
-		 "1" ((size + 63) >> 6), "2" (addr),
+		:"0" (0ULL), "1" (size), "2" (addr),
 		 [addr] "r" (addr) : "memory");
 	return res;
 }
@@ -99,6 +136,9 @@
  */
 long find_first_bit(const unsigned long * addr, unsigned long size)
 {
+#if BITOPS_CHECK_UNDERFLOW_RANGE
+	BUG_ON (size + 63 < size);
+#endif
 	return __find_first_bit(addr,size);
 }
 

-- 
Alexandre Oliva         http://www.lsd.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer   aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist  oliva@{lsd.ic.unicamp.br, gnu.org}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/