Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 3745

Troubleshooting • Re: headless pi5 not responding every 4-10 days, journalctl entries simply end

$
0
0
well the mmc errors seem to have gone away after switching to the new microsd card but unfortunately the same? crash issue appears to be continuing.

after 4 days the machine locked up. rebooted and 4 days later it locked up again. both times the case felt quite warm to the touch.

logs before the first crash (found locked up Oct 07 at 17:29):

Code:

Oct 07 11:56:45 hostname kernel: usb 4-1: reset SuperSpeed USB device number 2 using xhci-hcdOct 07 11:56:47 hostname kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0sOct 07 11:56:47 hostname kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00Oct 07 11:56:45 hostname udisksd[715]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/ST5000DM000_XXXXXX_XXXXXXXX: Error updating SMART data: Error sending ATA command CHECK POWER MODE: Unexpected sense data returned:                                  0000: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00    ................                                  0010: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00    ................                                   (g-io-error-quark, 0)Oct 07 12:44:54 hostname systemd[1]: Starting apt-daily.service - Daily apt download activities...Oct 07 12:58:25 hostname rtkit-daemon[1128]: The canary thread is apparently starving. Taking action.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Demoting known real-time threads.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Successfully demoted thread 1145 of process 1109.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Successfully demoted thread 1127 of process 1109.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Successfully demoted thread 1153 of process 1103.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Successfully demoted thread 1126 of process 1103.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Successfully demoted thread 1142 of process 1107.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Successfully demoted thread 1125 of process 1107.Oct 07 12:58:25 hostname rtkit-daemon[1128]: Demoted 6 threads.Oct 07 12:59:42 hostname rtkit-daemon[1128]: The canary thread is apparently starving. Taking action.Oct 07 12:59:42 hostname rtkit-daemon[1128]: Demoting known real-time threads.Oct 07 12:59:42 hostname rtkit-daemon[1128]: Successfully demoted thread 1145 of process 1109.Oct 07 12:59:42 hostname rtkit-daemon[1128]: Successfully demoted thread 1127 of process 1109.Oct 07 12:59:42 hostname rtkit-daemon[1128]: Successfully demoted thread 1153 of process 1103.-- Boot f648e8cc79744e048358dc9b8f894e6a --
nothing obvious in the logs before the second crash (found locked up Oct 11 at 12:30 PM):

Code:

Oct 11 12:26:44 hostname CRON[15114]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)Oct 11 12:26:44 hostname CRON[15160]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)Oct 11 12:26:44 hostname CRON[15225]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)Oct 11 12:26:44 hostname CRON[15226]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)Oct 11 12:26:44 hostname CRON[15216]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)Oct 11 12:26:44 hostname CRON[15227]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)Oct 11 12:26:44 hostname CRON[15160]: pam_unix(cron:session): session closed for user rootOct 11 12:26:44 hostname CRON[15114]: pam_unix(cron:session): session closed for user rootOct 11 12:26:44 hostname CRON[15216]: pam_unix(cron:session): session closed for user rootOct 11 12:26:44 hostname systemd[1]: Starting man-db.service - Daily man-db regeneration...
looking back for anything strange in the netconsole logs, i see consistent xhci-hcd errors, just not near the second crash time.

Code:

[130488.825492] usb 4-1: reset SuperSpeed USB device number 2 using xhci-hcd[130488.852881] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s[130488.862590] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 27 1a cc a0 00 00 20 00[130488.870287] I/O error, dev sda, sector 5248541952 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 2[130915.310414] usb 4-1: reset SuperSpeed USB device number 2 using xhci-hcd[130915.337668] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s                                                                                                    [130915.347421] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 3f ac 95 04 00 00 20 00[130915.355128] I/O error, dev sda, sector 8546199584 op 0x0:(READ) flags 0x80700 phys_seg 8 prio class 2[156827.366096] usb 4-1: reset SuperSpeed USB device number 2 using xhci-hcd[156827.393485] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s[156827.403193] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 3e 51 e5 a0 00 00 20 00[156827.410890] I/O error, dev sda, sector 8364436736 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 2[358216.045348] usb 4-1: reset SuperSpeed USB device number 2 using xhci-hcd[358216.072604] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s[358216.082338] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
would usb-storage.quirks be relevant in this situation? - (source: viewtopic.php?t=245931 ). i did some searching and there appear to be a number of threads about xhci issues with the pi5. some of them seem to have been fixed by a kernel update about a year ago ( https://github.com/raspberrypi/linux/issues/5753 ).

further drive details: # lsusb -v

Code:

lsusb -vBus 004 Device 002: ID 04e8:6126 Samsung Electronics Co., Ltd D3 StationDevice Descriptor:  bLength                18  bDescriptorType         1  bcdUSB               3.00  bDeviceClass            0  bDeviceSubClass         0  bDeviceProtocol         0  bMaxPacketSize0         9  idVendor           0x04e8 Samsung Electronics Co., Ltd  idProduct          0x6126  bcdDevice            2.04  iManufacturer           1 Samsung  iProduct                2 D3 Station  iSerial                 3 000000000XXXXXXX  bNumConfigurations      1  Configuration Descriptor:    bLength                 9    bDescriptorType         2    wTotalLength       0x002c    bNumInterfaces          1    bConfigurationValue     1    iConfiguration          0    bmAttributes         0xc0      Self Powered    MaxPower                8mA    Interface Descriptor:      bLength                 9      bDescriptorType         4      bInterfaceNumber        0      bAlternateSetting       0      bNumEndpoints           2      bInterfaceClass         8 Mass Storage      bInterfaceSubClass      6 SCSI      bInterfaceProtocol     80 Bulk-Only      iInterface              0      Endpoint Descriptor:        bLength                 7        bDescriptorType         5        bEndpointAddress     0x83  EP 3 IN        bmAttributes            2          Transfer Type            Bulk          Synch Type               None          Usage Type               Data        wMaxPacketSize     0x0400  1x 1024 bytes        bInterval               0        bMaxBurst               7      Endpoint Descriptor:        bLength                 7        bDescriptorType         5        bEndpointAddress     0x0a  EP 10 OUT        bmAttributes            2          Transfer Type            Bulk          Synch Type               None          Usage Type               Data        wMaxPacketSize     0x0400  1x 1024 bytes        bInterval               0        bMaxBurst               7Binary Object Store Descriptor:  bLength                 5  bDescriptorType        15  wTotalLength       0x0016  bNumDeviceCaps          2  USB 2.0 Extension Device Capability:    bLength                 7    bDescriptorType        16    bDevCapabilityType      2    bmAttributes   0x00000002      HIRD Link Power Management (LPM) Supported  SuperSpeed USB Device Capability:    bLength                10    bDescriptorType        16    bDevCapabilityType      3    bmAttributes         0x00    wSpeedsSupported   0x000e      Device can operate at Full Speed (12Mbps)      Device can operate at High Speed (480Mbps)      Device can operate at SuperSpeed (5Gbps)    bFunctionalitySupport   1      Lowest fully-functional device speed is Full Speed (12Mbps)    bU1DevExitLat          10 micro seconds    bU2DevExitLat         512 micro secondsDevice Status:     0x000d  Self Powered  U1 Enabled  U2 Enabled
# smartctl -d sat -a /dev/sda1

Code:

smartctl 7.3 2022-02-28 r5338 [aarch64-linux-6.6.51+rpt-rpi-2712] (local build)Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF INFORMATION SECTION ===Model Family:     Seagate Desktop HDD.15Device Model:     ST5000DM000-1FK178Serial Number:    XXXXXXXXLU WWN Device Id: 5 000c50 082bdf96aFirmware Version: CC48User Capacity:    5,000,981,078,016 bytes [5.00 TB]Sector Sizes:     512 bytes logical, 4096 bytes physicalRotation Rate:    5980 rpmDevice is:        In smartctl database 7.3/5319ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3bSATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)Local Time is:    Fri Oct 11 13:49:14 2024 CDTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDSee vendor-specific Attribute list for marginal Attributes.General SMART Values:Offline data collection status:  (0x00) Offline data collection activity                                        was never started.                                        Auto Offline Data Collection: Disabled.Self-test execution status:      (   0) The previous self-test routine completed                                        without error or no self-test has ever                                        been run.Total time to complete Offlinedata collection:                (  106) seconds.Offline data collectioncapabilities:                    (0x73) SMART execute Offline immediate.                                        Auto Offline data collection on/off support.                                        Suspend Offline collection upon new                                        command.                                        No Offline surface scan supported.                                        Self-test supported.                                        Conveyance Self-test supported.                                        Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering                                        power-saving mode.                                        Supports SMART auto save timer.Error logging capability:        (0x01) Error logging supported.                                        General Purpose Logging supported.Short self-test routinerecommended polling time:        (   1) minutes.Extended self-test routinerecommended polling time:        ( 635) minutes.Conveyance self-test routinerecommended polling time:        (   2) minutes.SCT capabilities:              (0x3035) SCT Status supported.                                        SCT Feature Control supported.                                        SCT Data Table supported.SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       33951272  3 Spin_Up_Time            0x0003   093   091   000    Pre-fail  Always       -       0  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       829  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       432  7 Seek_Error_Rate         0x000f   070   060   030    Pre-fail  Always       -       98993242142  9 Power_On_Hours          0x0032   043   043   000    Old_age   Always       -       50354 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       298183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       2 2 2189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1190 Airflow_Temperature_Cel 0x0022   052   040   045    Old_age   Always   In_the_past 48 (Min/Max 45/48 #344)191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       15193 Load_Cycle_Count        0x0032   051   051   000    Old_age   Always       -       99105194 Temperature_Celsius     0x0022   048   060   000    Old_age   Always       -       48 (0 18 0 0 0)195 Hardware_ECC_Recovered  0x001a   111   100   000    Old_age   Always       -       33951272197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       17279h+22m+23.824s241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       29962621432242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       298980426967SMART Error Log Version: 1No Errors LoggedSMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Short offline       Completed without error       00%     50354         -SMART Selective self-test log data structure revision number 1 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS    1        0        0  Not_testing    2        0        0  Not_testing    3        0        0  Not_testing    4        0        0  Not_testing    5        0        0  Not_testingSelective self-test flags (0x0):  After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.
any further ideas are greatly appreciated. thanks again.

Statistics: Posted by 0nobody0 — Fri Oct 11, 2024 6:53 pm



Viewing all articles
Browse latest Browse all 3745

Trending Articles