From: Dmitry Osipenko Subject: Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support) Date: Fri, 2 Mar 2018 19:39:25 +0300 Message-ID: <24166154-8b08-bab6-ac8e-618eeba30674@gmail.com> References: <1511962879-24262-1-git-send-email-adrian.hunter@intel.com> <1511962879-24262-7-git-send-email-adrian.hunter@intel.com> <829308a3-3bf6-c173-65fa-e2a0f45f7f61@intel.com> <68886f99-97f5-897a-f754-6f414741bd5a@gmail.com> <22580b82-0257-b156-9f0c-79afa34067e5@gmail.com> <8876217f-ede6-fc81-2e05-b4fc976b3235@intel.com> <6a1267b0-6242-fc9f-60ed-02bf34677b62@intel.com> <20180301160418.GA2490@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: Adrian Hunter , Ulf Hansson , linux-mmc , linux-block , linux-kernel , Bough Chen , Alex Lemberg , Mateusz Nowak , Yuliy Izrailov , Jaehoon Chung , Dong Aisheng , Das Asutosh , Zhangfei Gao , Sahitya Tummala , Harjani Ritesh , Venu Byravarasu , Linus Walleij , Shawn Lin , Bartlomiej Zolnierkiewicz < To: Andreas Dilger , Theodore Ts'o Return-path: In-Reply-To: Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On 01.03.2018 23:20, Andreas Dilger wrote: > > On Mar 1, 2018, at 9:04 AM, Theodore Ts'o wrote: >> This doesn't seem to make sense; the PC is where we are currently >> executing, and LR is the "Link Register" where the flow of control >> will be returning after the current function returns, right? Well, >> dx_probe should *not* be returning to __wait_on_bit(). So this just >> seems.... weird. >> >> Ignoring the LR register, this stack trace looks sane... I can't see >> which pointer could be NULL and getting dereferenced, though. How >> easily can you reproduce the problem? Can you either (a) translate >> the PC into a line number, or better yet, if you can reproduce, add a >> series of BUG_ON's so we can see what's going on? Ted, thank you for the suggestion. I don't have a bug-reproducer, it happens only under some IO load and quite randomly. I've applied the BUG_ON()'s, but it may take some time to catch the bug again. >> + BUG_ON(frame); > > I think you mean: > BUG_ON(frame == NULL); > or > BUG_ON(!frame); > > >> memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0])); >> frame->bh = ext4_read_dirblock(dir, 0, INDEX); >> if (IS_ERR(frame->bh)) >> return (struct dx_frame *) frame->bh; >> >> + BUG_ON(frame->bh); >> + BUG_ON(frame->bh->b_data); > > Same here. > > BUG_ON(frame->bh == NULL); > BUG_ON(frame->bh->b_data == NULL); > > This is why I don't like implicit "is NULL" or "is non-zero" usage. Lustre > used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to > abandon that because of upstream code style. Well spotted, thanks Andreas. >> root = (struct dx_root *) frame->bh->b_data; >> if (root->info.hash_version != DX_HASH_TEA && >> root->info.hash_version != DX_HASH_HALF_MD4 && >> root->info.hash_version != DX_HASH_LEGACY) { >> >> These are "could never" happen scenarios from looking at the code, but >> that will help explain what is going on. >> >> If this is reliably only happening with mq, the only way I could see >> that if is something is returning an error when it previously wasn't. >> This isn't a problem we're seeing with any of our testing, though.