From: =?UTF-8?Q?Maciej_=C5=BBenczykowski?= <maze@google.com>
Subject: Re: NULL pointer dereference in ext4_ext_remove_space on 3.5.1
Date: Thu, 16 Aug 2012 14:40:53 -0700
Message-ID: <CANP3RGd62=voh5T6NACyFE-NqX=Huk1hkewSakPs67vC+uuTuw@mail.gmail.com>
References: <CABRT9RAOhaxcYdCxMn5neJ9WT85r=h=7WgZ2dmLaOs-MMqDW9A@mail.gmail.com>
	<20120816024654.GB3781@thunk.org>
	<20120816111051.GA16036@localhost>
	<20120816152513.GA31346@thunk.org>
	<CANP3RGdMfM0tZbaJS9dFetRhXoGtyS0Nx4hZER_Qv0a061X_dQ@mail.gmail.com>
	<20120816211948.GF31346@thunk.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
To: "Theodore Ts'o" <tytso@mit.edu>,
	=?UTF-8?Q?Maciej_=C5=BBenczykowski?= <maze@google.com>,
	Fengguang Wu <fengguang.wu@intel.com>,
	Marti Raudsepp <marti@juffo.org>,
	Kernel hackers <linux-kernel@vger.kernel.org>,
	ext4 hackers <linux-ext4@vger.kernel.org>
In-Reply-To: <20120816211948.GF31346@thunk.org>
Sender: linux-ext4-owner@vger.kernel.org

> Maciej, you weren't able to reliably repro the crash were you?  I'm
> pretty sure this should fix the crash, but it would be really great to
> confirm things.
>
> I suspect creating a file system with a really small journal may make
> it easier to reproduce, but I haven't had time to try create a
> reliable repro for this bug yet.

This happened twice to me while moving data off of a ~1TB ext4 partition.
The data portion was on a stripe raid across 2 ~500GB drives, the
journal was on a relatively large partition (500MB?) on an SSD.
(crypto and lvm were also involved).
I've since emptied the partition and deleted even the raid array.

Both times it happened during rm, first time rm -rf of a directory
tree, second time during rm of a 250GB disk image generated by dd
(from a notebook drive).
Both rm's were manually run by me from a shell command line, and there
was pretty much nothing else happening on the machine at the time.

I'm not aware of there having been anything interesting (like:
holes/punch/sparseness, much r/w activity in the middle of files, etc)
on this filesystem, it was pretty much just a write-once data backup
that I had copied elsewhere and was deleting.  The 250GB disk image
was definitely just a sequentially written disk dump, and I think the
same thing holds true for the contents of the wiped directory tree
(although in many much smaller files).

I know i=1 in both cases (and dissasembly pointed out the location
where the above debug patch is BUGing), but I don't think it's
possible to figure out what inode # it crashed on.

Perhaps just untarring a bunch of kernels onto an empty partition,
filling it up, then deleting those kernels should be sufficient to
repro this (untried).

Perhaps something like:
  create 1TB filesystem
  untar a thousand kernel source trees on to it
  create 20GB files of junk until it is full
  rm -rf /


- Maciej