2005-02-03 14:29:03

by Mr. Berkley Shands

[permalink] [raw]
Subject: 2.6.10 Kernel BUG at hugetlbpage:212 (x86_64 and i386)

On an 8GB dual cpu opteron (Tyan S2884) 2.6.10 kernel I can reproduce a crash within
several minutes by creating/mapping/deleting/unmapping 512MB files using the
code below. On a IA32 box (Xeon 2.4GHz, 3GB SuperMicro X5DA8) the crash is
fatal (no /var/log/messages) and immediate. Executables for x86_64 and i386
from either FC3 or Redhat ES3.0 available on request. GCC is 3.4.2 on both
machines and O/S releases.


Feb 2 11:28:49 noreaster kernel: ----------- [cut here ] --------- [please bite here ] ---------
Feb 2 11:28:49 noreaster kernel: Kernel BUG at hugetlbpage:212
Feb 2 11:28:49 noreaster kernel: invalid operand: 0000 [1] SMP
Feb 2 11:28:49 noreaster kernel: CPU 1
Feb 2 11:28:49 noreaster kernel: Modules linked in:
Feb 2 11:28:49 noreaster kernel: Pid: 15687, comm: DssiEPSearch Not tainted 2.6.10
Feb 2 11:28:49 noreaster kernel: RIP: 0010:[<ffffffff8011e3cb>] <ffffffff8011e3cb>{unmap_hugepage_range+75}
Feb 2 11:28:49 noreaster kernel: RSP: 0018:00000100d7701dd8 EFLAGS: 00010206
Feb 2 11:28:49 noreaster kernel: RAX: 000001012eee98e0 RBX: 000001001000a0c0 RCX: 0000000061001000
Feb 2 11:28:49 noreaster kernel: RDX: 0000000061001000 RSI: 0000000041000000 RDI: 000001012eee98e0
Feb 2 11:28:49 noreaster kernel: RBP: 0000000041000000 R08: 0000000061200000 R09: 00000100d7701f10
Feb 2 11:28:49 noreaster kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000061001000
Feb 2 11:28:49 noreaster kernel: R13: 00000101f6ee0040 R14: 0000000041000000 R15: 000001012eee98e0
Feb 2 11:28:49 noreaster kernel: FS: 0000002a96b88080(0000) GS:ffffffff80611c00(0000) knlGS:0000000000000000
Feb 2 11:28:49 noreaster kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 2 11:28:49 noreaster kernel: CR2: 0000002a9632e970 CR3: 000000001184a000 CR4: 00000000000006e0
Feb 2 11:28:49 noreaster kernel: Process DssiEPSearch (pid: 15687, threadinfo 00000100d7700000, task 00000101215c63f0)
Feb 2 11:28:49 noreaster kernel: Stack: 0000000000000206 000001001000a0c0 0000000041000000 000001012eee9838
Feb 2 11:28:49 noreaster kernel: 00000101f6ee0040 0000000061200000 000001012eee98e0 ffffffff8015d800
Feb 2 11:28:49 noreaster kernel: 0000000000000000 ffffffff802bf59b
Feb 2 11:28:49 noreaster kernel: Call Trace:<ffffffff8015d800>{unmap_vmas+320} <ffffffff802bf59b>{fpga_ioctl+2171}
Feb 2 11:28:49 noreaster kernel: <ffffffff802c141b>{fpga_read+555} <ffffffff80161a23>{do_munmap+547}
Feb 2 11:28:49 noreaster kernel: <ffffffff801627a8>{sys_munmap+72} <ffffffff8010d216>{system_call+126}
Feb 2 11:28:49 noreaster kernel:
Feb 2 11:28:49 noreaster kernel:
Feb 2 11:28:49 noreaster kernel: Code: 0f 0b 33 1f 49 80 ff ff ff ff d4 00 4c 89 f5 4d 39 e6 66 66
Feb 2 11:28:49 noreaster kernel: RIP <ffffffff8011e3cb>{unmap_hugepage_range+75} RSP <00000100d7701dd8>
Feb 2 11:32:32 noreaster kernel: Linux version 2.6.10 (root@noreaster) (gcc version 3.4.2) #47 SMP Wed Feb 2 08:01:40 CST 2005

// code to access a huge tlb filesystem

#if defined(__linux__) && defined(__x86_64__)
HugePageFileName_ = new char[64];
::sprintf(HugePageFileName_, "/mnt/huge/Silo_XXXXXX");
::mkstemp(HugePageFileName_); // randomize this name
Mmap_Fd_ = ::open(HugePageFileName_, O_CREAT | O_RDWR | O_LARGEFILE | O_TRUNC, 0755);
if (Mmap_Fd_ != -1)
{
LONG64 MyBig = Big * sizeof(LONG64) + ((2UL * 1024UL * 1024UL) - 1);
MyBig &= ~((2UL * 1024UL * 1024UL) - 1);
BigBlock0_ = (LONG64 *) ::mmap(NULL, MyBig,
(PROT_READ | PROT_WRITE),
MAP_SHARED, Mmap_Fd_, 0);
if (BigBlock0_ == MAP_FAILED || !BigBlock0_)
{
::perror("mmap failed for huge pages");
::cerr << "Asked for ";
printHRNumber(MyBig, cerr);
::cerr << endl;
::close(Mmap_Fd_);
::unlink(HugePageFileName_);
delete [] HugePageFileName_;
HugePageFileName_ = NULL;
Mmap_Fd_ = -1;
BigBlock0_ = NULL;
BigBlock_ = NULL;
}
else
{
BigBlock_ = BigBlock0_;
BigSize_ = MyBig; // remember this size
::unlink(HugePageFileName_); // pre-delete this file so segfaults
// do not leave huge pages tied up.
}
}
else
{
if (Debug_)
{
::perror("Sorry, no HUGE pages today");
}
}
#endif

// snip from /etc/rc.local

#
# setup the huge file system space pages are 2MB each
#
if [-f /proc/fpga0]; then
echo "Allocating huge file system"
echo 1536 > /proc/sys/vm/nr_hugepages
mount none /mnt/huge -t hugetlbfs -o size=3G,mode=0777
fi


Attachments:
huge.bug (4.40 kB)

2005-02-04 16:39:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.10 Kernel BUG at hugetlbpage:212 (x86_64 and i386)

On Thu, 3 Feb 2005, Mr. Berkley Shands wrote:
> Reproducible BUG on 3GB hugetlbfs filesystem for opterons and xeons with
> either
> FC3 or RedHat ES3.0 and GCC 3.4.2. Details and code snippets in attachment.
> Executables to reproduce BUG are available on request.

Patch below (against 2.6.11-rc3, applies at offset to 2.6.10) fixes
the unmap_hugepage_range BUGs I could generate: does it fix yours?

Hugh

The hugetlb_page test in do_munmap is too permissive. It checks start
vma, but forgets that end vma might be different and huge though start
is not: so hits unmap_hugepage_range BUG if misaligned end was given.

And it's too restrictive: munmap has always succeeded on unmapped areas
within its range, why should it behave differently near a hugepage vma?

And the additional checks in is_aligned_hugepage_range are irrelevant
here, when the hugepage vma already exists. But the function is still
required (on some arches), as the default for prepare_hugepage_range -
leave renaming cleanup to another occasion.

Signed-off-by: Hugh Dickins <[email protected]>

--- 2.6.11-rc3/mm/mmap.c 2005-02-03 09:06:16.000000000 +0000
+++ linux/mm/mmap.c 2005-02-04 15:40:25.000000000 +0000
@@ -1808,13 +1808,6 @@ int do_munmap(struct mm_struct *mm, unsi
return 0;
/* we have start < mpnt->vm_end */

- if (is_vm_hugetlb_page(mpnt)) {
- int ret = is_aligned_hugepage_range(start, len);
-
- if (ret)
- return ret;
- }
-
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (mpnt->vm_start >= end)
@@ -1828,6 +1821,8 @@ int do_munmap(struct mm_struct *mm, unsi
* places tmp vma above, and higher split_vma places tmp vma below.
*/
if (start > mpnt->vm_start) {
+ if (is_vm_hugetlb_page(mpnt) && (start & ~HPAGE_MASK))
+ return -EINVAL;
if (split_vma(mm, mpnt, start, 0))
return -ENOMEM;
prev = mpnt;
@@ -1836,6 +1831,8 @@ int do_munmap(struct mm_struct *mm, unsi
/* Does it split the last one? */
last = find_vma(mm, end);
if (last && end > last->vm_start) {
+ if (is_vm_hugetlb_page(last) && (end & ~HPAGE_MASK))
+ return -EINVAL;
if (split_vma(mm, last, end, 1))
return -ENOMEM;
}

2005-02-04 19:49:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.10 Kernel BUG at hugetlbpage:212 (x86_64 and i386)

On Fri, Feb 04, 2005 at 04:39:02PM +0000, Hugh Dickins wrote:
> Patch below (against 2.6.11-rc3, applies at offset to 2.6.10) fixes
> the unmap_hugepage_range BUGs I could generate: does it fix yours?
> The hugetlb_page test in do_munmap is too permissive. It checks start
> vma, but forgets that end vma might be different and huge though start
> is not: so hits unmap_hugepage_range BUG if misaligned end was given.
> And it's too restrictive: munmap has always succeeded on unmapped areas
> within its range, why should it behave differently near a hugepage vma?
> And the additional checks in is_aligned_hugepage_range are irrelevant
> here, when the hugepage vma already exists. But the function is still
> required (on some arches), as the default for prepare_hugepage_range -
> leave renaming cleanup to another occasion.
> Signed-off-by: Hugh Dickins <[email protected]>

As usual, excellent work. Thanks for fixing this up.

Acked-by: William Irwin <[email protected]>


-- wli

2005-02-04 20:21:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.6.10 Kernel BUG at hugetlbpage:212 (x86_64 and i386)

On Fri, Feb 04, 2005 at 02:06:05PM -0600, Mr. Berkley Shands wrote:
> Sorry, but I still crash. This time it hung the kernel so bad I had to
> powerfail to restart.

Well, that fix is needed anyway.

On Fri, Feb 04, 2005 at 02:06:05PM -0600, Mr. Berkley Shands wrote:
> Feb 4 13:43:19 eclipse kernel: RIP: 0010:[<ffffffff8011e3cb>]
> <ffffffff8011e3cb>{unmap_hugepage_range+75}

Could you try this?


-- wli


Index: mm2-2.6.11-rc2/arch/i386/mm/hugetlbpage.c
===================================================================
--- mm2-2.6.11-rc2.orig/arch/i386/mm/hugetlbpage.c 2005-01-29 01:13:39.000000000 -0800
+++ mm2-2.6.11-rc2/arch/i386/mm/hugetlbpage.c 2005-02-04 12:05:12.000000000 -0800
@@ -209,14 +209,17 @@
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t pte;
+ pte_t pte, *ptep;
struct page *page;

BUG_ON(start & (HPAGE_SIZE - 1));
BUG_ON(end & (HPAGE_SIZE - 1));

for (address = start; address < end; address += HPAGE_SIZE) {
- pte = ptep_get_and_clear(huge_pte_offset(mm, address));
+ ptep = huge_pte_offset(mm, address);
+ if (!ptep)
+ continue;
+ pte = ptep_get_and_clear(ptep);
if (pte_none(pte))
continue;
page = pte_page(pte);

2005-02-04 20:38:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.10 Kernel BUG at hugetlbpage:212 (x86_64 and i386)

On Fri, 4 Feb 2005, Mr. Berkley Shands wrote:
> >
> Sorry, but I still crash. This time it hung the kernel so bad I had to
> powerfail to restart.

Thanks for trying, okay, your problem is a different one from
what my patch fixes: I thought that might turn out to be so.

> Feb 4 13:43:19 eclipse kernel: Kernel BUG at hugetlbpage:212
> Feb 4 13:43:19 eclipse kernel: invalid operand: 0000 [1] SMP
> Feb 4 13:43:19 eclipse kernel: CPU 1
> Feb 4 13:43:19 eclipse kernel: Modules linked in:
> Feb 4 13:43:19 eclipse kernel: Pid: 1374, comm: DssiEPSearch Not tainted
> 2.6.10
> <ffffffff8011e3cb>{unmap_hugepage_range+75} RSP <000001007f337dd8>
>
> patch applied and rebooted (I made sure this time :-)

And kernel rebuilt, I trust!

> unless of course I'm not functional today :-)
> The patch I had was for mmap.c, not in hugetlb.c. Did I miss something?

No, that's right, it was only fixing mm/mmap.c.

Please point me privately to your app source, oh, it may be gigantic,
or difficult for me to compile, I guess your binary as well or instead:
so I can try to reproduce on i386.

We notice fpga_read and fpga_ioctl in both your traces, not in our
2.6.10 kernel source: just leftover addresses on the stack, not part
of the real backtrace, but interestingly there in both. Where are
they from (FC3?), any idea what chance it's doing something bad?

Hugh