Search This Blog

Monday, May 21, 2018

Repairing Corrupted LUKS Encrypted Filesystem

This morning I noticed that my USB backup drive would not mount properly.  The first thing I did was check the dmesg log from Debian/Linux...

[1962217.824829] usb 1-5.1: New USB device found, idVendor=0bc2, idProduct=3300
[1962217.824832] usb 1-5.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[1962217.824834] usb 1-5.1: Product: Desktop         
[1962217.824835] usb 1-5.1: Manufacturer: Seagate 
[1962217.824837] usb 1-5.1: SerialNumber: 2GHNRSVE    
[1962217.825589] usb-storage 1-5.1:1.0: USB Mass Storage device detected
[1962217.825927] scsi host4: usb-storage 1-5.1:1.0
[1962218.832372] scsi 4:0:0:0: Direct-Access     Seagate  Desktop          0130 PQ: 0 ANSI: 4
[1962218.833253] sd 4:0:0:0: Attached scsi generic sg1 type 0
[1962218.833370] sd 4:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[1962218.833691] sd 4:0:0:0: [sdb] Write Protect is off
[1962218.833694] sd 4:0:0:0: [sdb] Mode Sense: 2f 08 00 00
[1962218.833991] sd 4:0:0:0: [sdb] No Caching mode page found
[1962218.833996] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[1962218.842262]  sdb: sdb1
[1962218.843668] sd 4:0:0:0: [sdb] Attached SCSI disk
[1962219.029068] sd 4:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
[1962219.029072] sd 4:0:0:0: [sdb] tag#0 Sense Key : Hardware Error [current] [descriptor] 
[1962219.029074] sd 4:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
[1962219.029077] sd 4:0:0:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
[1962219.111836] sd 4:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
[1962219.111839] sd 4:0:0:0: [sdb] tag#0 Sense Key : Hardware Error [current] [descriptor] 
[1962219.111841] sd 4:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
[1962219.111844] sd 4:0:0:0: [sdb] tag#0 CDB: ATA command pass through(12)/Blank a1 06 20 da 00 00 4f c2 00 b0 00 00
[1962251.479639] JBD2: Invalid checksum recovering block 144 in log
[1962251.574260] JBD2: recovery failed
[1962251.574265] EXT4-fs (dm-4): error loading journal
[1962258.650930] JBD2: Invalid checksum recovering block 144 in log
[1962258.749283] JBD2: recovery failed
[1962258.749288] EXT4-fs (dm-4): error loading journal

This generally means that there were issues with the unmount and the "checksum"  for block 144 in the journal does not match the actual checksum from disk.  This means there is an error reading the journal, and to prevent further corruption EXT4-fs will not mount via device mapper.

So how do we fix this?  If this was not encrypted you could google this error and find out to run FSCK and hopefully be done with it.

Lets start with a quick reminder of the utility `lsblk`.

root@CLCFQ92:~# lsblk --fs /dev/sdb
NAME                                          FSTYPE      LABEL  UUID   
sdb                               
└─sdb1                                        crypto_LUKS        ce6cebbc-5026-4f47-9a22-da4aecfd26ad 
  └─luks-ce6cebbc-5026-4f47-9a22-da4aecfd26ad ext4        Backup e7bab3dc-87ce-4a7d-b758-34e2839b51f0 

 This 'console' output does not render well above using the blog defaults, so I'll have to modify it for clarity, but the thing to notice is that the File System type for sdb1 is NOT ext4, it is crypto_LUKS.  There is then an extended partition that is 'ext4' called luks-.  This luks partition is the one you want to run fsck on.  NOT /dev/sdb1 (or /dev/sdb).

So lets first open the encrypted fs...

root@CLCFQ92:~# cryptsetup luksOpen /dev/sdb1 corrupted
Enter passphrase for /dev/sdb1: 
Now, lets recover the device via the name ("corrupted") we mapped to the luks partition... 
root@CLCFQ92:~# fsck /dev/mapper/corrupted 
fsck from util-linux 2.29.2
e2fsck 1.43.4 (31-Jan-2017)
Backup: recovering journal
JBD2: Invalid checksum recovering block 144 in log
Journal checksum error found in Backup
Backup was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (436669919, counted=433657941).
Fix? yes
Free inodes count wrong (121861163, counted=121856397).
Fix? yes

Backup: ***** FILE SYSTEM WAS MODIFIED *****
Backup: 245363/122101760 files (0.4% non-contiguous), 54719555/488377496 blocks
Next, disable/close the mapped name to the luks partition:
root@CLCFQ92:~# cryptsetup luksClose /dev/mapper/corrupted
Then mount/open the filesystem the way you normally would using gnome or whatever.  Check dmesh logs for confirmation!

At this point it may be good to look at using the `smartctl` utility, as your disk may be old/dying...

root@CLCFQ92:~# smartctl -a /dev/sdb
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   118   100   006    Pre-fail  Always       -       200743680
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   063   030    Pre-fail  Always       -       24115411
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       20273
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       39
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   053   043   045    Old_age   Always   In_the_past 47 (Min/Max 42/47 #179)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       767
194 Temperature_Celsius     0x0022   047   057   000    Old_age   Always       -       47 (0 19 0 0 0)
195 Hardware_ECC_Recovered  0x001a   022   021   000    Old_age   Always       -       200743680
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       12697 (213 75 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3453343066
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2807657092

Based on the output from the SMART controller, this disk is having a lot of errors and should be backed up and disposed of.