Backing up your data is supposed to be easy: just ctrl-C, ctrl-V the folders you want to backup. In a typical case, there would be 10s of 1000s of files to copy. It would typically take hours to finish the job in that case. That's OK, because I could issue my command, and then get busy in something else, and let my machine do the copying. The problem arises when something goes wrong in between. Sometimes there would be a customary error message which isn't much helpful. If there is none, a crude way to find that out is to compare the sizes of the source and destination folders. You may find that they aren't the same! Someone please tell me what went wrong! Which file couldn't get copied? Did it just give up there and proceed? Where there multiple failures? So, now do I have to do it all over again? How do I know it won't fail this time?
That motivates me to have a script which does it a bit more systematically. In the sense that if something goes wrong, it leaves me with some information about where it went wrong so that I can go and fix it. Thereafter, when I resume my backup process, it shouldn't just start all over again, because that would be such an awful waste of time. Instead, it should just skip the part which has got backed up successfully, and copy those parts of my data which didn't get backed up.
Actually, this is technically called data synchronisation, or more precisely, file synchronisation.
In the next few paragraphs, I will explain the problem of data synchronisation in some more detail. I will explain the ideas which lead to the algorithm as I have implemented in the script. However, if you are here in search of the script, by all means, feel free to skip over to the source code.
A good way to look at this logical arrangement of data is the tree data-structure. Treat S and D as the roots of two trees. Each file and directory in S and D are nodes of the respective trees. The files and directories contained in a directory d are its child nodes, and d is their parent. Since directory nodes can have child nodes, they are non-leaf or internal nodes. And since files can't contain anything else, they can't have child nodes. Thus, they are called leaf nodes.
With the above terminology in place, we could rephrase our description of the back process concisely as follows: At the end of the backup, we would like to have the tree rooted at D look exactly the same as that rooted at S.
Declaratively, the process to achieve it is ridiculously simple:
Another important feature I wish it to have is that it doesn't result in proliferation of data in the backup drive. Let me try to briefly explain what this particular problem is and how we intend to handle it.
In the next few paragraphs, I will explain the problem of data synchronisation in some more detail. I will explain the ideas which lead to the algorithm as I have implemented in the script. However, if you are here in search of the script, by all means, feel free to skip over to the source code.
Data Synchronisation
What are we trying to do in backup? As hinted above, we are just trying to create a copy of some data in a source location S, into a destination location D. In the process, we also want that all items that get copied must be related to each other in the same way both in D and S. For example, if there's a file f somewhere in S, it should also be found in D. Additionally, if there's a directory d somewhere in S, with a file f and a directory d', then we expect that in D, we would find a d, with file f and directory d' contained in it.A good way to look at this logical arrangement of data is the tree data-structure. Treat S and D as the roots of two trees. Each file and directory in S and D are nodes of the respective trees. The files and directories contained in a directory d are its child nodes, and d is their parent. Since directory nodes can have child nodes, they are non-leaf or internal nodes. And since files can't contain anything else, they can't have child nodes. Thus, they are called leaf nodes.
With the above terminology in place, we could rephrase our description of the back process concisely as follows: At the end of the backup, we would like to have the tree rooted at D look exactly the same as that rooted at S.
Declaratively, the process to achieve it is ridiculously simple:
- Each file f in S should have a copy f in D too. Each directory d in S should have a copy d in D too.
- Let D.d be the copy in D of a directory d in S, denoted likewise by S.d. For every node S.n (whether a file or a directory) which is a child of S.d, its copy D.n should be the child of D.d.
- For every file S.f and D.f, D.f should be copied from S.f. This implies that if S.f is found newer than D.f, S.f is copied to D.f. Otherwise, the copying is skipped.
Any node n in S (again, denoted by S.n) is synchronised if:
- There exists D.n.
- If S.n happens to be a file, then D.n is newer than S.n.
- If S.n happens to be a directory, then all its children are synchronised.
proc synchronise(directory d)
forall(n in d)
synchronise(n)
endfor
end proc
The above procedure makes explicit the recursive nature of the synchronisation algorithm. Note that synchronise procedure calls itself as many times as there are nodes in d.
The algorithm to achieve the above post-condition is a bit more complicated. Its implementation in shell-script is significantly more complicated with my knowledge level. It would be very exciting if someone provided a much simpler implementation.
The Script
So, here goes:
#!/bin/bash TRUE=1; FALSE=0; sourceRoot=""; # fill in appropriate value destinationRoot=""; # fill in appropriate value backupDirectory=""; # fill in appropriate value function backupDir(){ local prefix="$1"; # name of the directory to be backed up. local completeName=$(getCompletePathName "${prefix}"); # complete name of the directory to be backed up. if [ ! -d "${completeName}" ]; then echo "${completeName} is not a directory. Returning ..."; return 1; fi echo "entering directory '${completeName}'"; local suffix=${completeName#"${sourceRoot}"}; # the name of the directory relative to the sourceRoot. if [ ! -d "${destinationRoot}/${suffix}" ]; then # ${destinationRoot}/${suffix} is the complete name of the destination folder # echo "creating destination ${destinationRoot}/${suffix}"; mkdir "${destinationRoot}/${suffix}"; # there should be a way to check if the make directory succeeded. fi isDirectoryEmpty $completeName; local result=$?; if [ ! "$result" == "$TRUE" ]; then for name in ${completeName}/*; do # for name in `find $completeName -maxdepth 1`; do if [ "$name" == "$completeName" ]; then continue; fi local inBetween=${suffix%"${name}"}; isSubdirectory "${completeName}" "${name}"; local result=$?; local filename=`basename "$name"`; if [ -d "${name}" ]; then backupDir "${name}"; elif [ -f "${name}" ]; then local destinationFileName="${destinationRoot}/${inBetween}/${filename}" if [ -f "$destinationFileName" ]; then whichFileIsOlder "${name}" "${destinationFileName}"; local older=$?; if [ "${older}" -eq "2" ]; then # echo "copying file ${name} to ${destinationFileName}"; cp "${name}" "${destinationRoot}/${inBetween}"; # there should be a way to check if the file copy succeeded. # else # echo "No need to copy ${name}."; fi else # echo "copying file ${name} to ${destinationFileName}"; cp "${name}" "${destinationRoot}/${inBetween}"; # there should be a way to check if the file copy succeeded. fi else echo "Something wrong with $name"; exit 1; fi done; fi # echo "exiting directory ${prefix} ..."; } function whichFileIsOlder() { local file1="$1"; local file2="$2"; local sdate=`date +%s -r "${file1}"`; local ddate=`date +%s -r "${file2}"`; local diff=`expr ${ddate} - ${sdate}`; if [ "${diff}" -lt "0" ]; then return 2; else return 1; fi } function isSubdirectory(){ if [ ! $# == 2 ]; then echo "isSubdirectory : function takes 2 parameters; your provided $#."; exit 1; fi local dir1=$(getCompletePathName $1); local dir2=$(getCompletePathName $2); local l=`expr length "${dir1}"`; local p=${dir2:0:${l}}; if [ "${p}" == "${dir1}" ]; then return ${TRUE}; fi return ${FALSE}; } function createParentDirectories(){ local name="$1"; # name of the directory the directory structure of whose ancestor directories has to be created in the destinationRoot. local completeName=$(getCompletePathName "$1"); # complete name of name isSubdirectory "${sourceRoot}" "${completeName}"; local result=$?; if [ ! ${result} == ${TRUE} ]; then echo "createParentDirectories : Can't create the parent directories because ${name} is not a subdirectory of ${sourceDirectory}."; return; fi local suffix=${completeName#"${sourceRoot}"}; if [ "${suffix}" == "" ]; then # echo "createParentDirectories : No directories to be created."; return; fi local inBetween=${suffix%"${name}"}; if [ "${inBetween}" == "" ]; then # echo "createParentDirectories : No directories to be created."; return; fi local subDirectories=( `echo ${inBetween} | tr "/" "\n"` ); local dirName="${destinationRoot}"; for n in ${subDirectories[@]}; do dirName="${dirName}/${n}"; if [ ! -d "${dirName}" ]; then # echo "creating directory ${dirName}"; mkdir "${dirName}"; fi done } function getCompletePathName(){ if [ ! $# == 1 ]; then echo "completePathName : function takes 1 parameters; your provided $#."; fi local name=$1; local completeName=$1; if [ ! ${name:0:1} == "/" ]; then completeName="`pwd`/${name}"; fi echo "${completeName}"; } function isDirectoryEmpty(){ if find "$1" -maxdepth 0 -empty | read; then return $TRUE; else return $FALSE; fi } function backup(){ echo "backupDirectory = $backupDirectory"; if [ "$backupDirectory" == "" ]; then echo "backupDirectory can't be empty. Quitting ..." return; fi if [ "$destinationRoot" == "" ] || [ "$sourceRoot" == "" ]; then echo "destinationRoot or sourceRoot variable can't be empty. Quitting ..."; return; fi createParentDirectories "${backupDirectory}"; backupDir "${backupDirectory}"; }