Ryzen 9 5950x error
事故描述
dmesg 报错:
<0>[790577.354149] mce: [Hardware Error]: CPU 6: Machine Check Exception: 4 Bank 1: bc800800060c0859
<0>[790577.354155] mce: [Hardware Error]: TSC 98830989a3dde ADDR 103ec1fa40 MISC d01a000000000000 IPID 100b000000000
<0>[790577.354160] mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1755581242 SOCKET 0 APIC c microcode a201210
<0>[790577.354163] mce: [Hardware Error]: Machine check: Uncorrected unrecoverable error in kernel context
<0>[790577.354164] Kernel panic - not syncing: Fatal local machine check
定位错误
寻找文档
使用 lscpu 查询 CPU 型号,其中有:
转换成 16进制, 25d = 19h, 33d = 21h,在 AMD 官网上查找 AMD Family 19h Model 21h 的文档:Processor Programming Reference (PPR) for AMD Family 19h Model 21h, Revision B0 Processors (PUB)
查阅文档
根据 3.2.3 [Mapping of Banks to Blocks] 可知,报错中出现的 bank 1 是 IF Block,即 Instruction Fetch Unit。因此可在 3.2.5.2 [IF] 中找到 MCA::IF::MCA_STATUS_IF 寄存器各位的含义,解释错误码:bc800800060c0859。
0x bc800800060c0859 =
0b 1011 1100 1000 0000 0000 1000 0000 0000 0000 0110 0000 1100 0000 1000 0101 1001
因为默认是全零,所以重点关注非 0 字段:
| bit | 值 | 含义 | |
|---|---|---|---|
| 63 | 1 | Val | A valid error has been detected. |
| 61 | 1 | UC | The error was not corrected by hardware. |
| 60 | 1 | En | CA error reporting is enabled for this error |
| 59 | 1 | MiscV | Valid thresholding in MCA::IF::MCA_MISC0_IF. |
| 58 | 1 | AddrV | MCA::IF::MCA_ADDR_IF contains address information associated with the error. |
| 55 | 1 | TCC | Hardware context of the process thread to which the error was reported may have been corrupted. |
| 43 | 1 | Poison | The error was the result of attempting to consume poisoned data. |
| 29:24 | 000110 | AddrLsb | Specifies the least significant valid bit of the address contained in MCA::IF::MCA_ADDR_IF[ErrorAddr]. |
| 21:16 | 001100 | ErrorCodeExt | Extended Error Code. This field is used to identify the error type for root cause analysis. This field indicates which bit position in MCA::IF::MCA_CTL_IF enables error reporting for the logged |
| 15:0 | 0x0859 | ErrorCode | Error code for this error. See 3.1.3.3 [Error Codes] for details on decoding this field. |
根据上述信息,可知 CPU 发生了硬件无法纠正的错误,并且可能导致线程上下文损坏了,必须终止进程。
错误地址
通过 AddrLsb 可知,错误发生的位置可以定位到某一 64 字节的 cache line 上,地址需要通过 MCA::IF::MCA_ADDR_IF 这个寄存器查看,推断应该是 dmesg 报错第二行中的 ADDR 103ec1fa40 这个地址的 cache line 出错了。
ErrorCodeExt
ErrorCodeExt = 12, 查 MCA::IF::MCA_CTL_IF , 是:
L2RespPoison. Read-write. Reset: 0. L2 Cache Response Poison Error. Error is the result of consuming poison data.
但 3.1.3.3 [Error Codes] 说,MCA_STATUS[ErrorCodeExt] 并不直接表示错误类型本身,只是表示这个位使能了。就是说,错误发生实际位置可能不是 L2,但是 L2 确实获得了一个 poison 数据。
ErrorCode
根据 ErrorCode:0000 1000 0101 1001,查 3.1.3.3 [Error Codes],满足条件的只有 Table 26 中的第三条:
- Error Code: 0000 1XXT RRRR XXLL
- Error Code Type: Bus
- Description:
- XX = Reserved
- T = Timeout
- RRRR = Memory Transaction Type
- LL = Cache Level
故推断:T = 0, RRRR = 0101, LL = 01。即报错的 cache level (LL) 是 L1,memory transaction type (RRRR) 是 Instruction Fetch。
结论
综上,报错原因应该是:在 0x103ec1fa40 这个 cache line 进行 Instruction Fetch 操作的时候,发生了 L1 Cache 相关的总线事务错误,并且这个错误还导致 L2 收到了 poison 数据。