Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758535AbcCCVyJ (ORCPT ); Thu, 3 Mar 2016 16:54:09 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:62782 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758517AbcCCVyG (ORCPT ); Thu, 3 Mar 2016 16:54:06 -0500 Subject: Re: [PATCH] mce-apei: do not rely on ACPI_ERST_GET_RECORD_ID for record id To: "Luck, Tony" References: <1457031814-17435-1-git-send-email-jbacik@fb.com> <20160303214945.GA11233@intel.com> CC: , , , , , , From: Josef Bacik Message-ID: <56D8B24C.9080003@fb.com> Date: Thu, 3 Mar 2016 16:53:16 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160303214945.GA11233@intel.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.54.13] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-03-03_16:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2260 Lines: 66 On 03/03/2016 04:49 PM, Luck, Tony wrote: >> retry: >> - rc = erst_get_record_id_next(&pos, record_id); >> - if (rc) >> - goto out; >> + /* >> + * Some hardware is broken and doesn't actually advance the record id > > I'd blame this on firmware rather than hardware. Yup sorry misspoke. > >> + * returned by ACPI_ERST_GET_RECORD_ID when we read a record like the >> + * spec says it is supposed to. So instead use record_id == 0 to just >> + * grab the first record in the erst, and fall back only if we trip over >> + * a record that isn't a MCE record. >> + */ >> + if (lookup_record) { >> + rc = erst_get_record_id_next(&pos, record_id); >> + if (rc) >> + goto out; >> + } else { >> + *record_id = 0; >> + } >> /* no more record */ >> if (*record_id == APEI_ERST_INVALID_RECORD_ID) >> goto out; >> rc = erst_read(*record_id, &rcd.hdr, sizeof(rcd)); >> - /* someone else has cleared the record, try next one */ >> - if (rc == -ENOENT) >> - goto retry; >> - else if (rc < 0) >> + /* >> + * someone else has cleared the record, try next one if we are looking >> + * up records. If we aren't looking up the record id's then just bail >> + * since this means we have an empty table. >> + */ >> + if (rc == -ENOENT) { >> + if (lookup_record) >> + goto retry; >> + rc = 0; >> + goto out; >> + } else if (rc < 0) { >> goto out; >> - /* try to skip other type records in storage */ >> - else if (rc != sizeof(rcd) || >> - uuid_le_cmp(rcd.hdr.creator_id, CPER_CREATOR_MCE)) >> + } else if (rc != sizeof(rcd) || >> + uuid_le_cmp(rcd.hdr.creator_id, CPER_CREATOR_MCE)) { >> + /* try to skip other type records in storage */ >> + lookup_record = true; >> goto retry; > > Are you still doomed by the buggy firmware if we take this "goto"? > You be back at the top of the loop excpecting erst_get_record_id_next() > to move on. Does this just not happen in practice (finding non MCE > records in amognst the MCE ones)? > So one of the boxes had a non MCE record with MCE records, but yeah it was on a box that was also broken in this way. I'm not super worried about this case for us, I just want to have a fallback for any firmware that may not be broken in this way to be able to skip things. Thanks, Josef