Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp3427231ybt; Tue, 30 Jun 2020 02:28:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyqlQdkb8ix/so8gx2Zmrkm01uUXxuuEGn8vbUOvhQ+wELIZ2q4nqBCls8VIDGUPQQQdndt X-Received: by 2002:a50:e801:: with SMTP id e1mr21393786edn.251.1593509308844; Tue, 30 Jun 2020 02:28:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593509308; cv=none; d=google.com; s=arc-20160816; b=aXMvLh5By2HMRndYEgncmIR9T4K6G+NT20OfREBU8VOIFrwgilVOs7eyDlzfZGCLzK gkZRSmmkaFbcpClXIPW229p2YwFnU1BK44KgZOkHBSvHN31V81VwI41Insb5xZB6Eirh 6qXnyCupEDGPxFL9AiXuKX0ustfUgrKrfKZgtTuLh7pqfDf4nbfPexx/yGQeeJqvHFLH fIxF/qf38f42xhLJduMeN1spoW5B8MVnPS0rMJlJWgp5kgZwNVWsbIQdCFhc1bxuk2O+ arP6nTPjVXvNWPtt1zo1ap7PGZHjNpmCa42EJ7fyBVUbhY/8F3Ta16Sn5EDtmbBnQ92b 5q4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:mail-followup-to :reply-to:message-id:subject:cc:to:from:date; bh=Yz1PdnhDElWe4MhnaDL70xW4TJbjyGpuKWn7MFfe9ns=; b=zFCERkOc9Cec2yLJDnuZOC+qd39rj3zF6eTGrKmwJUDFfBH0erLszwnDq1DgBKCwfx Xjiii+6YKTQdFOyUKCXVG7+kDbqp6QxLi5esHmuRm8uisKSr6YhxoKb+9YZMlzJVxb8e cm47aDmHKUsDj500uvJem9ON2y2sKcNjjShNJELmnQyCM875jpUz0Y9y1utRbnJCtvdk 6LbOxxZb1kBgqz2Uw8158JXGZtwE31cXFfHG+7mCnLEaGZ0QPwKHa961YtiT0Ro/Wa/Z zC6wJrNBSLv3p+3GrWBbk2y0AfpDmIO4b8xtC0wNv28V5EACownNZDG50/iyYufeQF7d xwIA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k19si1367072eji.419.2020.06.30.02.28.06; Tue, 30 Jun 2020 02:28:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731782AbgF3JXN (ORCPT + 99 others); Tue, 30 Jun 2020 05:23:13 -0400 Received: from mx2.suse.de ([195.135.220.15]:59634 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727059AbgF3JXM (ORCPT ); Tue, 30 Jun 2020 05:23:12 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id D4E92AEDF; Tue, 30 Jun 2020 09:23:10 +0000 (UTC) Received: by ds.suse.cz (Postfix, from userid 10065) id 5BB65DA790; Tue, 30 Jun 2020 11:22:55 +0200 (CEST) Date: Tue, 30 Jun 2020 11:22:55 +0200 From: David Sterba To: Sebastian Hyrwall Cc: linux-kernel@vger.kernel.org Subject: Re: BTRFS/EXT4 Data Corruption Message-ID: <20200630092254.GW27795@twin.jikos.cz> Reply-To: dsterba@suse.cz Mail-Followup-To: dsterba@suse.cz, Sebastian Hyrwall , linux-kernel@vger.kernel.org References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23.1-rc1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 29, 2020 at 01:55:40AM +0700, Sebastian Hyrwall wrote: > Sorry if this is not the right place for this email but I can't think of > another place (might be linux-fsdevel) You can always CC the mailinglists of the filesystems. > Someone here is ought to be an expert in this. > > It all started as having file corruptions inside VMs that then led to > alot of testing that > resulted in replicatable results on the backend NAS. > > Tests where done by generating 100 1GB files from /dev/urandom to > "volume1" (both BTRFS and EXT4 tested). > MD5 hashing the files and then copying the files to "volume2". 2-4% of > the files would fail the hash match every time > the test was done. > > After alot of fiddling around it turned out that the problem goes away > if doing "cp --sparse=never" > when copying the files. This would to me exclude any hardware errors and > feels more like something > deeper inside the kernel. That the problem goes away when you use a completely different way to write data maybe just hiding the fact that hardware is faulty. Generating 100G of data will have different memory usage pattern and likely spanning way more pages than the reflink approach that will be metadata-only operation (adding the extent references). > The box runs Kernel 3.10.105. Version >4 seems unaffected (not 100% > confirmed, too few testboxes). > > Here is a diff between a hexdump of a failed file, > > 43861581c43861581 > < 29d464c0: aca0 d68f 0ff4 0bad fa4M-5 1339 8148 30e8 .........E.9.H0. > --- > > 29d464c0: aca0 d68f 0ff4 0bad fa45 1339 8148 30e8 .........E.9.H0. > 55989446c55989446 > < 35654c50: 31f4 f7b5 40be 2188 c539 043b 35b4 abb5 1...@.!..9.;5... > --- > > 35654c50: 3174 f7b5 40be 2188 c539 043b 35b4 abb5 1t..@.!..9.;5... > > As you can see it's a single flipped bit (31f4, 3174). I'm not sure > about "fa4M-5". Never seen "M-" before. If it's a bitflip, then it's faulty RAM. All other explanations like random memory overwrites typically lead to whole byte or byte sequences. The reasons for bad RAM could be a faulty module, but I've also seen transient bitflips on a box without enough PSU power when the system was under load. Which also makes it hard to make sure memtest will catch the errors, as was in my case, because the disks were not active. I'd recommend to stop using the machine for anything than testing.