Tech Bits: July 2012

Backing up your data is supposed to be easy: just ctrl-C, ctrl-V the folders you want to backup. In a typical case, there would be 10s of 1000s of files to copy. It would typically take hours to finish the job in that case. That's OK, because I could issue my command, and then get busy in something else, and let my machine do the copying. The problem arises when something goes wrong in between. Sometimes there would be a customary error message which isn't much helpful. If there is none, a crude way to find that out is to compare the sizes of the source and destination folders. You may find that they aren't the same! Someone please tell me what went wrong! Which file couldn't get copied? Did it just give up there and proceed? Where there multiple failures? So, now do I have to do it all over again? How do I know it won't fail this time?

That motivates me to have a script which does it a bit more systematically. In the sense that if something goes wrong, it leaves me with some information about where it went wrong so that I can go and fix it. Thereafter, when I resume my backup process, it shouldn't just start all over again, because that would be such an awful waste of time. Instead, it should just skip the part which has got backed up successfully, and copy those parts of my data which didn't get backed up.

Actually, this is technically called data synchronisation, or more precisely, file synchronisation.

Another important feature I wish it to have is that it doesn't result in proliferation of data in the backup drive. Let me try to briefly explain what this particular problem is and how we intend to handle it.

In the next few paragraphs, I will explain the problem of data synchronisation in some more detail. I will explain the ideas which lead to the algorithm as I have implemented in the script. However, if you are here in search of the script, by all means, feel free to skip over to the source code.

Data Synchronisation

What are we trying to do in backup? As hinted above, we are just trying to create a copy of some data in a source location S, into a destination location D. In the process, we also want that all items that get copied must be related to each other in the same way both in D and S. For example, if there's a file f somewhere in S, it should also be found in D. Additionally, if there's a directory d somewhere in S, with a file f and a directory d', then we expect that in D, we would find a d, with file f and directory d' contained in it.

A good way to look at this logical arrangement of data is the tree data-structure. Treat S and D as the roots of two trees. Each file and directory in S and D are nodes of the respective trees. The files and directories contained in a directory d are its child nodes, and d is their parent. Since directory nodes can have child nodes, they are non-leaf or internal nodes. And since files can't contain anything else, they can't have child nodes. Thus, they are called leaf nodes.

With the above terminology in place, we could rephrase our description of the back process concisely as follows: At the end of the backup, we would like to have the tree rooted at D look exactly the same as that rooted at S.

Declaratively, the process to achieve it is ridiculously simple:

Each file f in S should have a copy f in D too. Each directory d in S should have a copy d in D too.
Let D.d be the copy in D of a directory d in S, denoted likewise by S.d. For every node S.n (whether a file or a directory) which is a child of S.d, its copy D.n should be the child of D.d.
For every file S.f and D.f, D.f should be copied from S.f. This implies that if S.f is found newer than D.f, S.f is copied to D.f. Otherwise, the copying is skipped.

One more step of rephrasing the thoughts, and we have the following:

Any node n in S (again, denoted by S.n) is synchronised if:

There exists D.n.
If S.n happens to be a file, then D.n is newer than S.n.
If S.n happens to be a directory, then all its children are synchronised.

The above way of writing (particularly the last point) brings out a very important recursive structure of the solution. To ensure that a directory is synchronised, all we have to make sure is that all its contents are synchronised. In other words:


proc synchronise(directory d)
    forall(n in d)
        synchronise(n)
    endfor
end proc

The above procedure makes explicit the recursive nature of the synchronisation algorithm. Note that synchronise procedure calls itself as many times as there are nodes in d.

The algorithm to achieve the above post-condition is a bit more complicated. Its implementation in shell-script is significantly more complicated with my knowledge level. It would be very exciting if someone provided a much simpler implementation.

The Script

So, here goes:

#!/bin/bash

TRUE=1;
FALSE=0;

sourceRoot=""; # fill in appropriate value
destinationRoot=""; # fill in appropriate value
backupDirectory=""; # fill in appropriate value

function backupDir(){
 local prefix="$1"; # name of the directory to be backed up.
 local completeName=$(getCompletePathName "${prefix}"); # complete name of the directory to be backed up. 
 if [ ! -d "${completeName}" ]; then
  echo "${completeName} is not a directory. Returning ...";
  return 1;
 fi

 echo "entering directory '${completeName}'";
 local suffix=${completeName#"${sourceRoot}"}; # the name of the directory relative to the sourceRoot.

 if [ ! -d "${destinationRoot}/${suffix}" ]; then # ${destinationRoot}/${suffix} is the complete name of the destination folder
#  echo "creating destination  ${destinationRoot}/${suffix}";
  mkdir "${destinationRoot}/${suffix}"; # there should be a way to check if the make directory succeeded.
 fi
 
 isDirectoryEmpty $completeName;
 local result=$?;
 if [ !  "$result" == "$TRUE" ]; then
  for name in ${completeName}/*; do
 # for name in `find $completeName -maxdepth 1`; do
   if [ "$name" == "$completeName" ]; then
    continue;
   fi
   local inBetween=${suffix%"${name}"};
   isSubdirectory "${completeName}" "${name}";
   local result=$?;
   local filename=`basename "$name"`;
   if [ -d "${name}" ]; then
       backupDir "${name}";
   elif [ -f "${name}" ]; then
    local destinationFileName="${destinationRoot}/${inBetween}/${filename}"
    if [ -f "$destinationFileName" ]; then
     whichFileIsOlder "${name}" "${destinationFileName}";
     local older=$?;
     if [ "${older}" -eq "2" ]; then
#          echo "copying file ${name} to ${destinationFileName}";
          cp "${name}" "${destinationRoot}/${inBetween}"; # there should be a way to check if the file copy succeeded.
#     else
#          echo "No need to copy ${name}.";
     fi
    else
#     echo "copying file ${name} to ${destinationFileName}";
     cp "${name}" "${destinationRoot}/${inBetween}"; # there should be a way to check if the file copy succeeded.
        fi
   else
    echo "Something wrong with $name";
    exit 1;
   fi
  done;
 fi
# echo "exiting directory ${prefix} ...";
}

function whichFileIsOlder() {
 local file1="$1";
 local file2="$2";
 local sdate=`date +%s -r "${file1}"`;
 local ddate=`date +%s -r "${file2}"`;
 local diff=`expr ${ddate} - ${sdate}`;
 if [ "${diff}" -lt "0" ]; then
  return 2;
 else
  return 1;
 fi
}

function isSubdirectory(){
 if [ ! $# == 2 ]; then
  echo "isSubdirectory : function takes 2 parameters; your provided $#.";
  exit 1;
 fi
 local dir1=$(getCompletePathName $1);
 local dir2=$(getCompletePathName $2);

 local l=`expr length "${dir1}"`;
 local p=${dir2:0:${l}};
 if [ "${p}" == "${dir1}" ]; then
  return ${TRUE};  
 fi

 return ${FALSE};
}

function createParentDirectories(){
 local name="$1"; # name  of the directory the directory structure of whose ancestor directories has to be created in the destinationRoot.
 local completeName=$(getCompletePathName "$1"); # complete name of name
 isSubdirectory "${sourceRoot}" "${completeName}";
 local result=$?;
 if [ ! ${result} == ${TRUE} ]; then
  echo "createParentDirectories : Can't create the parent directories because ${name} is not a subdirectory of ${sourceDirectory}.";
  return;
 fi
 local suffix=${completeName#"${sourceRoot}"};
 if [ "${suffix}" == "" ]; then
#  echo "createParentDirectories : No directories to be created.";
  return;
 fi
 local inBetween=${suffix%"${name}"};
 if [ "${inBetween}" == "" ]; then
#  echo "createParentDirectories : No directories to be created.";
  return;
 fi
 local subDirectories=( `echo ${inBetween} | tr "/" "\n"` );
 local dirName="${destinationRoot}";
 for n in ${subDirectories[@]}; do
  dirName="${dirName}/${n}";
  if [ ! -d "${dirName}" ]; then
#   echo "creating directory ${dirName}";
   mkdir "${dirName}";
  fi
 done
}

function getCompletePathName(){
 if [ ! $# == 1 ]; then
  echo "completePathName : function takes 1 parameters; your provided $#.";
 fi
 local name=$1;
 local completeName=$1;
 if [ ! ${name:0:1} == "/" ]; then
  completeName="`pwd`/${name}";
 fi
 echo "${completeName}";
}

function isDirectoryEmpty(){
 if find "$1" -maxdepth 0 -empty | read;
 then
  return $TRUE;
 else
  return $FALSE;
 fi
}

function backup(){
 echo "backupDirectory = $backupDirectory";
 if [ "$backupDirectory" == "" ]; then
  echo "backupDirectory can't be empty. Quitting ..."
  return;
 fi
 if [ "$destinationRoot" == "" ] || [ "$sourceRoot" == "" ]; then
  echo "destinationRoot or sourceRoot variable can't be empty. Quitting ...";
  return;
 fi
 createParentDirectories "${backupDirectory}";
 backupDir "${backupDirectory}";
}

Additional Resources:

rsync Utility (thanks Sumantro)

Tech Bits

Bits of Learning

Tuesday, July 10, 2012

A Script for Backing Up

Data Synchronisation

The Script

Additional Resources:

About Me

Blog Archive