Recursive file/directory change-detection

Disclaimer

This article explores a way in which an approximate “fingerprint” of a file tree can be created! If all you want is to detect file changes a much more appropriate method would be to use inotify/incron.

Version 2 (update)

Another, much faster method would be to use ls -lR to browse over the filesystem. On a newly installed Debian virtual machine (on Xen) hashing the entire filesystem (the root directory) took approximately 1.7 seconds. So, here it is:

ls -lR "$D" | sha1sum | sed 's/[ -]//g'

This method is sensitive to file name, size and modification size; usually that would be enough but if you need more control use…

Version 1

Detect when the contents of a file or directory ($D) changes:

find "$D" | while read f; do stat -t "$f"; done | sha1sum | sed 's/[ -]//g'

This yields a hash of the current state of the file or directory which is extremely sensitive to even the most subtle changes (even a simple touch to any file/directory somewhere inside "$D" changes the generated hash).

Add -L to the find command to follow symbolic links.

Pros

  • Very easy to implement and reliable; it doesn’t require any changes to the existing OS infrastructure.
  • It can be used by any number of detached applications independently.
  • Hashes can be saved for later reference; they essentially represent the “fingerprint” of the path at a given moment in time, and can be used as such.

Cons

  • The method is not 100% reliable since hashes are used, which can (in some rare cases) yield collisions (the same hash could potentially be generated after a change is made). This short-coming is probably significant, however, only in some critical (to the extreme) cases which I have yet to meet in my lifetime.
  • The method can be straining for very large directories; the hard-drive(s) may become very slow while the script is running! I would not recommend applying this method to directories with thousands of entries or more, unless nothing else important is using the same hard-drive(s) or it is run at moments when the drive(s) are not used. Either way it’s on your hands!

Alternative

It would probably be best to “teach” the kernel to do it, because the actual system calls for operating on the file-system are there anyway. User programs would make a system call to register a certain path to be monitored. This path would be inserted into a look-up table by the kernel and the file-system routines would look-up this table and simply set the state-variable corresponding to the monitored path to a generated UUID code whenever a file is closed (or similar). This variable could then be verified by user programs through a simple system call. Maybe a kernel module could implement this with hooks?

4 comments

  1. Of course! You would use it like this:

    find "$D" | while read f; do stat -c'%n %s %b %f %u %g %D %i %h %t %T %Y %Z %o' "$f"; done | sha1sum | sed 's/[ -]//g'

    So instead of the -t option, you must use -c'%n %s %b %f %u %g %D %i %h %t %T %Y %Z %o' which is actually the exact same thing but without the %X format parameter which represents the access timestamp.

    Using the -c... options you can customize the script in any way you want, in very fine detail. Just see man stat for a reference on the format parameters.

    Hope this helps! 🙂

  2. Is there a way to modify the command to not use the ‘Accessed’ field in the generation of the hash? I am wondering because I need a way to scan a removable hard disk for new/modified files, but am not interested in if they were accessed or not since my app will be accessing the files in normal operation.

  3. [Edit] Removed a useless backtick execution of stat (i.e. echo `stat -c%s $f`).
    Without the backticks the command yields the same result, obviously.

Leave a Reply to valeriupCancel reply