Hi,
On Wed, 2004-01-14 at 21:38, Tsuchiya Yoshihiro wrote:
> I usually use two to seven boxes at a time, and I get about two problems
> out of them within about two nights.
I was able to reproduce with your script, too: even on ramfs. Curiouser
and curiouser.
Al Viro was a help getting further, and we nailed the reason for the
failure. I got a failure:
"something wrong (diff) happened at ae:619-th trial"
corresponding to the script code:
mkdir $TARGETDIR/dirXC$i || echo "Error $? making $TARGETDIR/dirXC$i" > $RDIR/MD-ERR-$1 2>&1
cd $TARGETDIR/dirXC$i > $RDIR/CD-ERR-$1 2>&1
if [ -s $RDIR/CD-ERR-$1 ]
then
echo "something wrong (cd) happened at $1:$i-th trial "
(df .; df -i .) > $RDIR/DF-$1
exit;
fi
tar zxf $MOZSRC >> $ERRORF
# echo "test dir for $TARGETDIR" >> $INOFILE-$1
ls -lid $SOURCE >> $INOFILE-$1
diff -rq $TARGETDIR/$SOURCE $TARGETDIR/dirXC$i/$SOURCE > $RESULTS/dirXC$i.result 2>&1
DIFFSIZE=`ls -l $RESULTS/dirXC$i.result | awk '{print $5}'`
if [ $DIFFSIZE != 0 ];
then
echo "something wrong (diff) happened at $1:$i-th trial "
(df .; df -i .) > $RDIR/DF-$1
exit;
and psacct was able to trace the order in which stuff happened; lastcomm showed:
tee guest ?? 5.55 secs Thu Jan 15 22:15
tar guest ?? 0.08 secs Thu Jan 15 22:24
gzip guest ?? 0.25 secs Thu Jan 15 22:24
xc-1.2 guest ?? 0.02 secs Thu Jan 15 22:15
xc-1.2 F guest ?? 6.37 secs Thu Jan 15 22:15
xc-1.2 F guest ?? 0.00 secs Thu Jan 15 22:24
df guest ?? 0.00 secs Thu Jan 15 22:24
df guest ?? 0.01 secs Thu Jan 15 22:24
xc-1.2 F guest ?? 0.01 secs Thu Jan 15 22:24
awk guest ?? 0.00 secs Thu Jan 15 22:24
ls guest ?? 0.01 secs Thu Jan 15 22:24
diff guest ?? 0.01 secs Thu Jan 15 22:24
rm guest ?? 0.02 secs Thu Jan 15 22:24
ls guest ?? 0.00 secs Thu Jan 15 22:24
mkdir guest ?? 0.01 secs Thu Jan 15 22:24
Reading from the bottom up: we get the "mkdir" and "ls -lid" of the new
directory, but not the tar; then the "rm &" of the previous iteration
completes; then there's the diff that failed, and the ls and awk from
that; then the two "df"s, then we exit.
And *then*, after all that, the tar/gunzip finishes. Remember, lastcomm
records the exit of each task, not its start.
So the problem seems to be that the shell is continuing beyond the "tar
xzf" before that command has finished, which is why we see ENOENT on
"cd" to the dir or on the diff.
The trace above was on the last running thread of the test. All other
threads had completed, so there was no interleaving of test runs in the
psacct records.
Now, I can't tell from this whether it's a bash bug or an exit/signal
bug, but it doesn't look like a filesystem problem for now. I'm going
to try with a different shell to see if that helps.
Cheers,
Stephen