Received: by 2002:a89:d88:0:b0:1fa:5c73:8e2d with SMTP id eb8csp1997195lqb; Mon, 27 May 2024 04:50:20 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVgEDFUiXKVeRDlh+ZbQe/BiauRWYEuyCgCOmgx7RXnN93JCHhag5GbAJJlbKnuFQw8GzpzAGNXfe56+I2e1j+BHB/T/YMYftPb279HGQ== X-Google-Smtp-Source: AGHT+IHsGaHMnn4/HttVTZd6zOrTv8DlttRs2SbyQyj7KsP3FaKMA2vqPy9Mr9R5g3n5VHaPDQ26 X-Received: by 2002:a05:6808:f8b:b0:3c7:ef1:483e with SMTP id 5614622812f47-3d1a7646fa6mr10892448b6e.39.1716810620462; Mon, 27 May 2024 04:50:20 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1716810620; cv=pass; d=google.com; s=arc-20160816; b=Ev5WoQeWBWZgur7FeHe26SEjTiD6v4xNwLtgcJrQMAUyvI+K2sHxMM0TF26l7IEFrZ fztgQsesbnLMO+4z8pYr6Sh2/EYdD8XN1DEUoc9DQQoZ0up6mqpJojTSmVCzvjl/iphk MX3cCdTULba5A8wzv7jopkE4xL+ewZLigRR2Hs9YfAjdatIUEq+/rPnX4+fAqCrn4GcT 4xggDl4ANJeMdWtNrJdeXICAvaUgS8DbkgbL0OkuXZDoqEtqGEH731eemt3rXuOo3xNG 50m2OWw1OAjohOuM+uGJU3L0ywZjs1ExcOkVE6Ol2oavhQuVvlaoOINhCDg8whBySGcu aECg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=OH5aKAj1MVf9l3KK8QJU5d7weeANBFxA2hF4ksQDlPQ=; fh=V54BVPs49mTldOrR7dHX4twJjw0Ji4qGRFqX5qRehjA=; b=KDM81CS09MclXNRMvFBgUCXzJAHEnS2E5271V/NwPyeuVRXH4tX6C2dfQnvLhrujyd kpGRnl2PgVt5W+tUcKIptR8pbaUCINZZoAoMMJxoOj8s2bxN6//AIruOuDSyiQk3k8o7 WgdyIB8CDgAQ2Ka/4Xh4RgcGRBAJ9NlBnM68zaWEtYGN8zlSssazgRPEBF4tBE2BoUYZ 6Ly+i+dEzyX06yyzH0s57uR2CQMkta0tr8Euh05sKO5/EE7sYEuM0BJZsJksYHaY+h9w 7U08u06bvXSyWKrEoZjlhMJnwJ/TYwuMrJ7VjqV5EnTUp1uY0bOMSWm1d8J8Hym9+AMu DQhA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@habana.ai header.s=default header.b=gU3sezJV; arc=pass (i=1 spf=pass spfdomain=habana.ai dkim=pass dkdomain=habana.ai dmarc=pass fromdomain=habana.ai); spf=pass (google.com: domain of linux-kernel+bounces-190505-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-190505-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=habana.ai Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id d75a77b69052e-43fc31c36casi44431921cf.719.2024.05.27.04.50.20 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 May 2024 04:50:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-190505-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@habana.ai header.s=default header.b=gU3sezJV; arc=pass (i=1 spf=pass spfdomain=habana.ai dkim=pass dkdomain=habana.ai dmarc=pass fromdomain=habana.ai); spf=pass (google.com: domain of linux-kernel+bounces-190505-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-190505-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=habana.ai Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 317B21C222A7 for ; Mon, 27 May 2024 11:50:20 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0B0BB15EFC7; Mon, 27 May 2024 11:48:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=habana.ai header.i=@habana.ai header.b="gU3sezJV" Received: from mail02.habana.ai (habanamailrelay.habana.ai [213.57.90.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6F6E715E5C6 for ; Mon, 27 May 2024 11:48:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=213.57.90.13 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716810527; cv=none; b=K3X0BmesUWmsdOEMFPEdYxMKVbu1Us65qm3jf78BHGkk87nECoFZXePUz1j80dUrI0UAsE2QumPk5FG2qCujKH2JTCMnFaY0u4E7m3fLvmgX1/5b3yFHGMi9AMZnJJ+q1E5fHwUcbA01gOLY1cuvItPwtrR6R54osD/aWZ5xB74= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716810527; c=relaxed/simple; bh=mp5DAl21UuJcDlAVTYNSw0SqQLrU/a4V6eKgkuG5J8Q=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=T9AnokWO0v05yB58qCPE27a9rotEnR9kcmVFWESPFurodlXG6hrGyeNJIdMMlKQ/2zeu6/085oY8jqbnBglHJbmFPhhxiT0yrAkG/Mkbx+CCUd0ViixRiCpUxAJcqpPWbSloq6eCaGiXqW8U85bKfweUoixIQE3BPaB0d780hnA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=habana.ai; spf=pass smtp.mailfrom=habana.ai; dkim=pass (2048-bit key) header.d=habana.ai header.i=@habana.ai header.b=gU3sezJV; arc=none smtp.client-ip=213.57.90.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=habana.ai Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=habana.ai Received: internal info suppressed DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=habana.ai; s=default; t=1716810492; bh=mp5DAl21UuJcDlAVTYNSw0SqQLrU/a4V6eKgkuG5J8Q=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gU3sezJVjMC+rZyyE2IDiEiPjdsfNX87Jxr+wY/gezbcd9j9FFyUzPrRVrXTAR1n4 DcF9m4YsVwW8UCj8vOVqdm0EsACpMYTDGWXxsSts63egR42iidyWg3V9LPdoBwSsrI JsI/NwlMRR0kXhh4vN8fjowwUWW7nrnTgcPZ1f+qo9q94cMGaDYjcnuzhQrdob1ERz NkaDjKcfW3+6Kpt0ZBVN8I0CGM9RY/mT62SeCD3SIac6I+90w6MVS0t7fWt+cdXOqA lO2n+0jcA7O6YKkPG4r6B4/hQ+7J7iJt5OqCGpuBtqRdt7kNvu900hP4VjMZ9nWlVC 1FYta+dXIvrow== Received: from obitton-vm-u22.habana-labs.com (localhost [127.0.0.1]) by obitton-vm-u22.habana-labs.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 44RBltNn1919357; Mon, 27 May 2024 14:47:56 +0300 From: Ofir Bitton To: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Cc: Farah Kassabri Subject: [PATCH 9/9] accel/habanalabs: add heartbeat debug info Date: Mon, 27 May 2024 14:47:46 +0300 Message-Id: <20240527114746.1919292-9-obitton@habana.ai> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240527114746.1919292-1-obitton@habana.ai> References: <20240527114746.1919292-1-obitton@habana.ai> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Farah Kassabri It is hard to debug the reason for heartbeat check failures. As an attempt to ease this task, this patch will provide more information when this failure happens. Heartbeat checks the communication with FW, so printing the CPU queue pi/ci and the counter of how many times that event was received would help in debugging the issue. Signed-off-by: Farah Kassabri Reviewed-by: Ofir Bitton --- drivers/accel/habanalabs/common/device.c | 12 ++++++++++++ drivers/accel/habanalabs/common/habanalabs.h | 15 ++++++++++++++- drivers/accel/habanalabs/gaudi2/gaudi2.c | 3 +++ 3 files changed, 29 insertions(+), 1 deletion(-) diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index bb3f44392908..35502e938b5d 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1052,12 +1052,22 @@ static bool is_pci_link_healthy(struct hl_device *hdev) static bool hl_device_eq_heartbeat_received(struct hl_device *hdev) { struct asic_fixed_properties *prop = &hdev->asic_prop; + u32 cpu_q_id; if (!prop->cpucp_info.eq_health_check_supported) return true; if (!hdev->eq_heartbeat_received) { + cpu_q_id = hdev->heartbeat_debug_info.cpu_queue_id; + dev_err(hdev->dev, "EQ heartbeat event was not received!\n"); + + dev_err(hdev->dev, "Heartbeat events counter: %u, Q_PI: %u, Q_CI: %u, EQ CI: %u, EQ prev: %u\n", + hdev->heartbeat_debug_info.heartbeat_event_counter, + hdev->kernel_queues[cpu_q_id].pi, + atomic_read(&hdev->kernel_queues[cpu_q_id].ci), + hdev->event_queue.ci, + hdev->event_queue.prev_eqe_index); return false; } @@ -1138,6 +1148,8 @@ static int device_late_init(struct hl_device *hdev) hdev->high_pll = hdev->asic_prop.high_pll; if (hdev->heartbeat) { + hdev->heartbeat_debug_info.heartbeat_event_counter = 0; + /* * Before scheduling the heartbeat driver will check if eq event has received. * for the first schedule we need to set the indication as true then for the next diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index 55495861f432..5e9f54ca336a 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -71,7 +71,7 @@ struct hl_fpriv; #define HL_DEVICE_TIMEOUT_USEC 1000000 /* 1 s */ -#define HL_HEARTBEAT_PER_USEC 5000000 /* 5 s */ +#define HL_HEARTBEAT_PER_USEC 10000000 /* 10 s */ #define HL_PLL_LOW_JOB_FREQ_USEC 5000000 /* 5 s */ @@ -3174,6 +3174,16 @@ struct hl_reset_info { u8 watchdog_active; }; +/** + * struct eq_heartbeat_debug_info - stores debug info to be used upon heartbeat failure. + * @heartbeat_event_counter: number of heartbeat events received. + * @cpu_queue_id: used to read the queue pi/ci + */ +struct eq_heartbeat_debug_info { + u32 heartbeat_event_counter; + u32 cpu_queue_id; +}; + /** * struct hl_device - habanalabs device structure. * @pdev: pointer to PCI device, can be NULL in case of simulator device. @@ -3262,6 +3272,7 @@ struct hl_reset_info { * @clk_throttling: holds information about current/previous clock throttling events * @captured_err_info: holds information about errors. * @reset_info: holds current device reset information. + * @heartbeat_debug_info: counters used to debug heartbeat failures. * @irq_affinity_mask: mask of available CPU cores for user and decoder interrupt handling. * @stream_master_qid_arr: pointer to array with QIDs of master streams. * @fw_inner_major_ver: the major of current loaded preboot inner version. @@ -3452,6 +3463,8 @@ struct hl_device { struct hl_reset_info reset_info; + struct eq_heartbeat_debug_info heartbeat_debug_info; + cpumask_t irq_affinity_mask; u32 *stream_master_qid_arr; diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c index 962b7fcd4318..08276f03c80f 100644 --- a/drivers/accel/habanalabs/gaudi2/gaudi2.c +++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c @@ -3796,6 +3796,8 @@ static int gaudi2_sw_init(struct hl_device *hdev) if (rc) goto special_blocks_free; + hdev->heartbeat_debug_info.cpu_queue_id = GAUDI2_QUEUE_ID_CPU_PQ; + return 0; special_blocks_free: @@ -9777,6 +9779,7 @@ static u16 event_id_to_engine_id(struct hl_device *hdev, u16 event_type) static void hl_eq_heartbeat_event_handle(struct hl_device *hdev) { + hdev->heartbeat_debug_info.heartbeat_event_counter++; hdev->eq_heartbeat_received = true; } -- 2.34.1