md5deep - Verify files, missing startx, lightdm, ext4 missing storage space, df reports incorrect size and files are bigger after rsync transfer

To verify a lot of files between two identical disks I made an MD5 file using md5deep:

apt-get install md5deep
cd /where/the/data/is/
md5deep -r -e -l -of * > /where/to/put/verification-file.md5

* = work from this directory
-r = recursive
-e = show estimate when working
-l = local paths, that's why moving to the directory with cd
-of = skip symbolic links

Note that md5deep can take a lot of time to finish.
For 1 TB of data with 468000 files md5deep took about 12 hours.
Watch disk temperatures while doing this as the disk gets a real workout.
(smartctl -A /dev/sdX | grep Temperature_Celsius)

When finished you may want to transfer the verification-file.md5 to another system, but before doing that remember do to this:

md5sum verification-file.md5 

Note the checksum in the filename for instance, then when you have transferred the file to the new system check again with md5sum. This is because the verification file itself can contain errors from the transfer.

Before diff-testing the files you may sort then as md5deep does not seem to take them in alphabetical order by default:

sort file.md5 > file-sorted.md5

Then you may diff:
diff file1.md5 file2.md5.

The output should be empty if the difference is zero.

startx is missing from Debian 7.8 by default after installing MATE via command line

apt-get install xinit

To start in text mode after installing MATE-desktop and lightdm you need to edit:

After creating an ext4 partition you may find that you miss some storage space, especially if you create a separate storage partition and compare it to for example an ReiserFS partition.

5% of ext4 disk space is reserved for root and system, this is not necessary for storage devices only (do not change this for root partition) and can be removed live:

tune2fs -m 0 /dev/sdX

This will free a lot of bytes if the disk is big. But still the partition will not have the same available space as an ReiserFS partition.

After this a partition went up from 848GB to 911 GB. Compared to ReiserFS which showed 925GB there were still missing about 14GB.

To get more you may need to recreate the partition - beware this erases data, like this:

mkfs.ext4 -m 0 -O sparse_super -T largefile4 /dev/sdX

-m 0 reserves 0 space for root and system
-O sparse_super, "Create a filesystem with fewer superblock backup copies (saves space on large filesystems)."
(this should be turned on already on later kernels)
-T largefile4, "Specify how the filesystem is going to be used, so that mke2fs can chose optimal filesystem parameters for that use. [...] largefile4 one inode per 4 megabytes"

After recreating the partition like this the partition went up to 925GB, nearly equal to the ReiserFS partition in available space.

BUT beware, doing this kind of formatting results in a very low amount of inodes. You may check the inode status by doing df -i. If you run out of inodes then you have no more available space again even if you have space.

apt-get install reiserfsprogs ...

Related to the above problem are the sparse files. Linux and virtual machines are very clever when handling big empty files, the system makes sparse files of these, and they are only big containers filled with zeros that have been kind of compressed (it is not exactly compression but easier to grip). So you may have a 5 GB big file that the system thinks is only 5 MB, but when needed it grows to the size of 5 GB.

Different programs handles the sparse files differently, and this is the problem.

df seems to go on file system metadata, some kind of total, not on the actual sizes out on the disk. Does not seem to tell the whole story.

du does by default not calculate sparse files (big files that are empty - filled with zeros) as big files and only reports the compressed size, but it can be told to report the uncompressed size by adding --apparent-size

ls reports the uncompressed sizes, with ls -sl the first column displays the actual used block size, useful to detect sparse files

rsync does not detect sparse files by default, but can do it with --sparse. Note that it cannot be used together with --inplace, so first delete the target data and redo the sync, then do --inplace as that grows the sparse files if needed. -u does not seem to work with sparse either. A bug report for these problems in rsync have been filed and a patch has been created but then the project has stalled since 2013-09:

As rsync does not detect the sparse files by default may make a straight forward operation like this very confusing:

rsync -av /old/ /new/

If the /old/ folder contain one or ore sparse files, then they may grow when copied to the /new/ folder.

Running rsync --sparse on a sparse file that has been uncompressed does seem to recompress it. This is quite interesting. Doing rsync -av --sparse /old/ /new/ may make the opposite of the above problem - the target may get smaller instead if there are a lot of files that could be sparsed.

cp can handle sparse files but it seems to differ between versions.

fallocate has a switch named -d / --dig-holes avaliable from util-linux 2.25 that if it works should be able to recompress, but version 2.25 is quite "new", it appeared in Debian Jessie 8. Version 2.20 does not have this switch.

This is a personal note. Last updated: 2015-06-25 12:29:20.







Don't forget to pay my friend a visit too. Joakim