2003-09-03 17:06:49

by Dave Olien

[permalink] [raw]
Subject: FYI: dbt testing on 2.6.0-test4-mm4 fails


Andrew,

I'm just mailing you this to keep you informed, Daniel McNeil and
I are investigating a failure of the dbt database workload test on
2.6.0-test4-mm4.

The failure MAY have begun as early as 2.6.0-test4. We were able
to test on test4 only after I generated a patch to raw_open() for that
kernel version. The database test4 failure LOOKS the same as the
test4-mm4 failure. But we haven't investigated it as closely there yet.
We know test3 worked OK. We may try some of the test3-mm patches to
see if something happened on one of those patches.

In the test4-mm4 case, the kernel doesn't oops or hang. Instead, the
database software detects a failure of some sort. We've done an
strace on the database processes, and in one of them we see the following
output:

_llseek(38, 8192, [8192], SEEK_SET) = 0
write(38, "\0\0\0\0\4\3\1\0\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 0

A seek on file descriptor 38 to offset 8192, followed by a write of 8k.
The write returns with 0 bytes written.

Immediately after this, we can see this process writing to the error
log a message indicating an error has been detected.

File descriptor 38 is for the file /dev/raw/raw1. This is the
transaction log file for the database. This is early in itialization
of the database, so it's initializing the transaction log file.


2003-09-03 17:38:07

by Andrew Morton

[permalink] [raw]
Subject: Re: FYI: dbt testing on 2.6.0-test4-mm4 fails

Dave Olien <[email protected]> wrote:
>
> I'm just mailing you this to keep you informed, Daniel McNeil and
> I are investigating a failure of the dbt database workload test on
> 2.6.0-test4-mm4.

hmm, the direct-io code hasn't changed significantly since February(!).

Which filesystem are you using?

One possibility is that some lower-level error occured in the filesystem or
the device driver but the error code was not correctly propagated back.
Could you sprinkle error-path printk's in the direct-io code?


2003-09-03 19:00:18

by Cliff White

[permalink] [raw]
Subject: Re: FYI: dbt testing on 2.6.0-test4-mm4 fails

> Dave Olien <[email protected]> wrote:
> >
> > I'm just mailing you this to keep you informed, Daniel McNeil and
> > I are investigating a failure of the dbt database workload test on
> > 2.6.0-test4-mm4.
>
> hmm, the direct-io code hasn't changed significantly since February(!).
>
> Which filesystem are you using?

dbt2-1tier uses raw for all the database devices. No filesystems are created
during the run.

dbt3-pgsl runs on filesystems, and has been running successfully on
recent kernels.

cliffw
>
> One possibility is that some lower-level error occured in the filesystem or
> the device driver but the error code was not correctly propagated back.
> Could you sprinkle error-path printk's in the direct-io code?
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-09-03 21:14:07

by Dave Olien

[permalink] [raw]
Subject: Re: FYI: dbt testing on 2.6.0-test4-mm4 fails


Right now, we're looking over older test runs and tracking down exactly
which patch set seems to have caused a break. We THINK we MIGHT have
a difference in behavior between mm3 and mm3-1.

We'll follow this path for a bit and see where it gets us.

But following that, yes we can put some prink's in and see where
that leads.

On Wed, Sep 03, 2003 at 10:20:42AM -0700, Andrew Morton wrote:
> Dave Olien <[email protected]> wrote:
> >
> > I'm just mailing you this to keep you informed, Daniel McNeil and
> > I are investigating a failure of the dbt database workload test on
> > 2.6.0-test4-mm4.
>
> hmm, the direct-io code hasn't changed significantly since February(!).
>
> Which filesystem are you using?
>
> One possibility is that some lower-level error occured in the filesystem or
> the device driver but the error code was not correctly propagated back.
> Could you sprinkle error-path printk's in the direct-io code?
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-09-12 21:40:08

by Daniel McNeil

[permalink] [raw]
Subject: Re: FYI: dbt testing on 2.6.0-test4-mm4 fails

Andrew,

I have isolated which patch is causing the dpt2 test failures.
The O_SYNC-speedup-nolock-fix.patch appears to be causing the
problem.

2.6.0-test5 passes the dbt2-tier1 STP test.
2.6.0-test5 + O_SYNC-speedup-2.patch (plm #2134) passes.
2.6.0-test5 + O_SYNC-speedup-2.patch + O_SYNC-speedup-nolock-fix.patch
(plm #2135) FAILS.

Dave Olien and I have added some debug output to try and catch writes()
that are returning 0, but the debug output does not seem to match the
errors we are getting from the dbt2 test runs. We get kernel debug
output on zero byte writes on regular files not on raw files which the
dbt2 test uses.

Any ideas on why/how this patch could be causing problems?

Thanks,

Daniel



On Wed, 2003-09-03 at 10:07, Dave Olien wrote:
> Andrew,
>
> I'm just mailing you this to keep you informed, Daniel McNeil and
> I are investigating a failure of the dbt database workload test on
> 2.6.0-test4-mm4.
>
> The failure MAY have begun as early as 2.6.0-test4. We were able
> to test on test4 only after I generated a patch to raw_open() for that
> kernel version. The database test4 failure LOOKS the same as the
> test4-mm4 failure. But we haven't investigated it as closely there yet.
> We know test3 worked OK. We may try some of the test3-mm patches to
> see if something happened on one of those patches.
>
> In the test4-mm4 case, the kernel doesn't oops or hang. Instead, the
> database software detects a failure of some sort. We've done an
> strace on the database processes, and in one of them we see the following
> output:
>
> _llseek(38, 8192, [8192], SEEK_SET) = 0
> write(38, "\0\0\0\0\4\3\1\0\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 0
>
> A seek on file descriptor 38 to offset 8192, followed by a write of 8k.
> The write returns with 0 bytes written.
>
> Immediately after this, we can see this process writing to the error
> log a message indicating an error has been detected.
>
> File descriptor 38 is for the file /dev/raw/raw1. This is the
> transaction log file for the database. This is early in itialization
> of the database, so it's initializing the transaction log file.