Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Sun, 18 Aug 2002 03:11:18 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Sun, 18 Aug 2002 03:11:17 -0400 Received: from mail14.speakeasy.net ([216.254.0.214]:31430 "EHLO mail.speakeasy.net") by vger.kernel.org with ESMTP id ; Sun, 18 Aug 2002 03:11:16 -0400 Subject: Re: cerberus errors on 2.4.19 (ide dma related) From: Ed Sweetman To: linux-kernel@vger.kernel.org In-Reply-To: <1029653085.674.53.camel@psuedomode> References: <1029653085.674.53.camel@psuedomode> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 Date: 18 Aug 2002 03:15:16 -0400 Message-Id: <1029654916.2037.0.camel@psuedomode> Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3851 Lines: 82 On Sun, 2002-08-18 at 02:44, Ed Sweetman wrote: > (overview written in hindsight of writing email) > I ran all these tests on ide/host2/bus0/target0/lun0/part1 > when dma was enabled or disabled is was done to both drives at the same > time. > I do not know if cerberus cares where it is run or not to do it's tests, > but the program was on the drive it was tested on when run and > throughout the email i assume it only runs it's drive tests primarily on > the partition you've run it on. I see now that this is probably wrong > and instead of changing where i run the test i should alternate which > drive gets dma enabled and disabled and process of elimination will show > just the kind of dma bug i'm seeing > (/overview) > > > I've been trying to track down why i seem to get disk corruption on my > harddrives after some good amount of usage all the time. It's been > happening for a long time across a number of different kernel versions. > I believe this is because i stick to the same board manufacturer, Abit > and use via chipsets. > > I ran cerberus with dma enabled at UDMA4 and UDMA2, at udma4 cerberus > reports MEMORY errors and BBidehost2bus0target0lun0discN1 errors, but > mostly MEMORY errors before the kernel panics after a minute or two. At > udma2 the cerberus reports no errors but panics after a minute or two. > I ran cerberus a couple times on each, with UDMA4 it began to error > about 30 seconds into the test with MEMORY errors. > > I thought, well this could be ram errors, so i ran memtest for a couple > hours. Nothing reported as being bad. I then thought, my hardware > could be the problem, so I ran e2fsck -c on the partition I was running > cerberus on with dma disabled via hdparm -d0 and it completed with no > errors found. I then rebooted, enabled udma2 and the kernel panic'd > with the same test after a few minutes. > > The rest of this email is just information regarding the setup > > > First off the way my fs's are setup are as follows: > > swap + files are now all on my primary master ide drive on the > motherboard ide controller. Swap on my primary master promise controller > seemed too problematic because of corruption, but i'm not sure if the > corruption i've seen is related only to the promise controller or if > it's not controller specific. I'll have to run the test without swap on > the promise drive and then run the test on my primary motherboard hdd > and again without swap. > > cerberus version : 1.3.0pre4 > dmesg info : http://signal-lost.homeip.net/lkml/dmesg > hdparm info : http://signal-lost.homeip.net/lkml/hdparm > pci info : http://signal-lost.homeip.net/lkml/lspci > > tests completed before escaping in pio mode: > http://signal-lost.homeip.net/lkml/tests_passed > > Errors during last test that caused kernel panic (udma2) > http://signal-lost.homeip.net/lkml/memory > > Errors during test of udma4 (first test) > http://signal-lost.homeip.net/lkml/memory2 > http://signal-lost.homeip.net/lkml/dmesg2 > various segfaults of badblocks of BBidehost tests. > > > I ran memtest for an extensive amount of time after the first test > reported memory errors and go absolutely no errors (wasn't using dma > mode at the time either). And since these errors aren't produced when > not using DMA on my drives I find it very unlikely that it's "System > Ram" as the cause of them. I'm going to rerun the test on my > motherboard primary drive after posting this in case something happens > and i hose everything. Forgot to add my kernel config. http://signal-lost.homeip.net/lkml/config - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/