From: Martin Subject: Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?) Date: Fri, 26 Oct 2012 22:44:41 +0200 Message-ID: <508AF639.30603@onlinehome.de> References: <50882787.3030504@onlinehome.de> <508AEEF7.8060301@onlinehome.de> <871uglyoap.fsf@spindle.srvr.nix> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Linux Kernel Mailing List , linux-ext4@vger.kernel.org, tytso@mit.edu, stable@vger.kernel.org, gregkh@linuxfoundation.org To: Nix Return-path: Received: from moutng.kundenserver.de ([212.227.17.9]:62946 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754212Ab2JZUou (ORCPT ); Fri, 26 Oct 2012 16:44:50 -0400 In-Reply-To: <871uglyoap.fsf@spindle.srvr.nix> Sender: linux-ext4-owner@vger.kernel.org List-ID: On 10/26/2012 10:24 PM, Nix wrote: > On 26 Oct 2012, Martin spake thusly: [...] >> I have studied my corruption problem more closely and can give you a >> description of what happened below. Would you say this may be the same >> bug? > > No. You want to keep up with the thread. Ted's first educated guess is > not always guaranteed to be correct (though this is rare). OK > >> Oct 15 19:56:12 >> >> Computer is booted again in order to copy a few files to memory stick. Unbeknownst to me, the following entries are logged in the >> system log: >> >> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5): add_dirent_to_buf:1587: inode #655361: block 2629945: comm mount: bad >> entry in directory: rec_len % 4 != 0 - offset=360(360), inode=655682, rec_len=18, name_len=5 >> Oct 15 20:00:16 harold kernel: Aborting journal on device sda5-8. >> Oct 15 20:00:16 harold kernel: EXT4-fs (sda5): Remounting filesystem read-only >> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in ext4_evict_inode:238: Journal has aborted >> Oct 15 20:00:16 harold kernel: EXT4-fs error (device sda5) in ext4_create:2120: IO failure > > That's an interesting failure, but looks slightly different to what I > saw. No bad directory entries, no aborted journals: a replayed journal > and subsequent corruption. Still damaged though, and after a journal > abort I'm not surprised you had problems! So my corrupt journal is simply the result of a user turning off the machine at a bad point in time? That's scary. In that scenario even the option data=journal wouldn't save me from harm, would it? Funny this happens to someone who has always said that robustness is the most important quality of a filesystem (and who thinks data=writeback is madness). > >> I will try to rename them to their >> proper name on another machine, and restore them on the target >> machine. However, due to the sheer number this might take forever. > > I relearned this week that backups are good. Backups are good, and always too old. > >> Also I am worried the problem might re-surface, as it has neither been >> identified nor fixed. > > I'm seeing it on almost every reboot. Indeed the symptoms look different. > >> NB: kernel was v3.5.5 > > Hm, this provides possible evidence that the problem does indeed extend > into 3.5.x. > >> with CK1 and BFQ patches, tainted by nvidia module. > > It's hard to reason about a kernel that's had *that* massive lump of > binary junk applied to it, alas. This may or may not be the same > problem: it has some common features with what I see, but not all. > true, i normally re-create problems with vanilla kernels before reporting them. In this case I was cleanly sniped with no chance of re-play so far.