Bits of Learning

Learning sometimes happens in big jumps, but mostly in little tiny steps. I share my baby steps of learning here, mostly on topics around programming, programming languages, software engineering, and computing in general. But occasionally, even on other disciplines of engineering or even science. I mostly learn through examples and doing. And this place is a logbook of my experiences in learning something. You may find several things interesting here: little cute snippets of (hopefully useful) code, a bit of backing theory, and a lot of gyan on how learning can be so much fun.
Showing posts with label backup. Show all posts
Showing posts with label backup. Show all posts

Wednesday, March 20, 2013

Backup Script in Python

Sometime back I put out a backup script written in bash shell script. After having used it successfully for a while, I decided to re-write it in Python. Several reasons:
It had bugs. The bugs were known (it's not as if it's a million lines of code). But I couldn't get rid of them. Simply because I could never understand the semantics of the language well enough to write what I wanted to, exactly.
On the other hand, Python has this elegant, readable, syntax. The things do as you would except them to. Compilation is not minimal as in the case of shell. So, it throws up errors when there are any, and you can catch them. Consequently, the python code that I came up with is smaller and more readable.
My initial experience in using it is that it seems to run quicker too. But I can't vouch for that.

I managed the re-implementation in a couple of hours, as opposed to probably several days worth of effort (with gaps) that went into the shell-script. Part of the reason was of-course that the design was absolutely clear in my mind as I had already done the implementation earlier. So, I don't repent having spent my effort building my first prototype and throwing it away. But another part is (please don't read this sentence if you are a shell-script fan): Python is more modern and better than shell-script.

Here are the main features:

  1. It will make the contents under sourceRoot identical to destinationRoot.
  2. It will do selective copying. Means, it will copy a file anywhere inside sourceRoot to the right place inside destinationRoot if and only if there's no corresponding copy in destinationRoot, or destination copy is found to be older than the source copy.
  3. It allows you to specify any subdirectory of sourceRoot that you want to backup. This is useful when you know that changes since the last backup are localised to a particular location, and you don't want to waste time scanning other locations. For large data, this saves a lot of time.


So, I invite you to prefer this new one to the old one. It does everything that the shell-script did. And it doesn't have the old bugs. So here goes the code:


#!/usr/bin/python

import os
import sys
import shutil

from os import *
from string import *

sourceRoot = ""; # fill in appropriate value
destinationRoot = ""; # fill in appropriate value
backupDirectoryName = ""; # fill in appropriate value

def isSubdirectory(dir1, dir2):
 if(len(dir1) < len(dir2)):
  return False
 if(not dir1[:len(dir2)] == dir2):
  return False
 return True

def getRelativeName(dirName, root):
 if(not isSubdirectory(dirName, root)):
  raise Exception(dirName + ' is not a subdirectory of ' + root)
 if(len(dirName) < len(root)):
  return dirName
 if(root[-1] == '/'):
  return dirName[len(root):]
 else:
  return dirName[len(root) + 1:]

# always provide the full path name
def backupDir(dirName):
 global sourceRoot

# print 'going into directory: ' + dirName
 relativePathName = getRelativeName(dirName, sourceRoot)
 currentDestinationDir = destinationRoot + '/' + relativePathName
 if(not os.path.exists(currentDestinationDir)):
  os.mkdir(currentDestinationDir)
 allnames = listdir(dirName)
 for name in allnames:
  sourceName = dirName +'/' + name
  if(os.path.isfile(sourceName)):
   destinationFileName = currentDestinationDir + '/' + name
   if((not os.path.exists(destinationFileName)) or (os.path.getmtime(sourceName) > os.path.getmtime(destinationFileName))):
    print 'copying ' + sourceName + ' to ' + currentDestinationDir + ' ...'
    shutil.copyfile(sourceName, destinationFileName)
#   else:
#    print 'not copying ' + sourceName + ' to ' + currentDestinationDir + ' ...'
  elif(os.path.isdir(sourceName)):
   backupDir(sourceName)
  else:
   print 'Something wrong with ' + name
# print 'going out of directory: ' + dirName

# path is relative to root, not an absolute path
def createPath(path, root):
 def getPathName(path):
  if(len(path) > 1):
   return path[0] + '/' + getPathName(path[1:])
  return path[0]

 if(not os.path.exists(root + '/' + getPathName(path))):
  if(len(path) > 1):
   createPath(path[:-1], root)
  os.mkdir(root + '/' + getPathName(path))

# given a full/absolute pathname dirName, this returns the path relative to root.
def getRelativePath(dirName, root):
 if(not isSubdirectory(dirName, root)):
  raise Exception(dirName + ' is not a subdirectory of ' + root)
 return split(getRelativeName(dirName, root), '/')

def backup(s, d):
 global backupDirectoryName
 global sourceRoot
 global destinationRoot

 sourceRoot = s
 destinationRoot = d

 if(len(sys.argv) != 2):
  print 'usage: backup.py dirname'
  sys.exit()

 backupDirectoryName = sys.argv[1]
 if(sourceRoot == ''):
  print "sourceRoot hasn't been set"
  sys.exit()
 if(destinationRoot == ''):
  print "destinationRoot hasn't been set"
  sys.exit()
 if(backupDirectoryName == ''):
  print "backupDirectoryName hasn't been set"
  sys.exit()
 if(not os.path.isdir(sourceRoot)):
  print "sourceRoot = " + sourceRoot + " doesn't exist."
  sys.exit()
 if(not os.path.isdir(destinationRoot)):
  print "destinationRoot = " + destinationRoot + " doesn't exist."
  sys.exit()
 if(not os.path.isdir(backupDirectoryName)):
  print "backupDirectory = " + backupDirectoryName + " doesn't exist."
  sys.exit()
 if(not isSubdirectory(backupDirectoryName, sourceRoot)):
  print "backupDirectoryName isn't a subdirectory of the sourceRoot."
  sys.exit()
 createPath(getRelativePath(backupDirectoryName, sourceRoot), destinationRoot)
 backupDir(backupDirectoryName)

The above code is a sort of library, which you use along with a driver script. An example follows:

#!/usr/bin/python

from backup import *

sourceRoot = '/home/sujitkc/my_work'
destinationRoot = '/home/sujitkc/work_backup'

backup(sourceRoot, destinationRoot)

We usually have multiple backup scenarios. For example, I have the following scenarios:

  1. I take daily (approximately) backup of my work folder at office.
  2. I take fortnightly or monthly backup of my personal data at home.
  3. I take monthly backup of my entire data.

In the above three cases, the only thing that changes is the sourceRoot and destinationRoot. Therefore, I have a separate driver for each of the above scenarios: backup_work.pybackup_personal.py and backup_all.py respectively. In each of these drivers, I have set the sourceRoot and destinationRoot variables to a different appropriate value. Depending on my backup scenario, I just have to run the corresponding driver script. That's all!
  
All you need to do to use the script are the following:
  • Save script 1 as backup.py somewhere and make it executable using:
chmod +x backup.py
  • Save script 2 as, say, backup_work.py, at the same location and make it executable using:
chmod +x backup_work.py
  • In script 2, change the values of the sourceRoot and destinationRoot variables to appropriate locations.
  • You are all set! Now run the script as follows:
./backup_work.py your_backup_directory
If you are just starting off with this script and are wondering what to put as your_backup_directory, here's a clue: often, the your_backup_directory parameter has the same value as sourceRoot.


Request

If you find the above script useful, it will be highly appreciated if you drop me a word of acknowledgement as a comment to this post. Feel free to communicate if you find any problem with the code.

Related Post:


A Script for Backing Up

Tuesday, July 10, 2012

A Script for Backing Up

Backing up your data is supposed to be easy: just ctrl-C, ctrl-V the folders you want to backup. In a typical case, there would be 10s of 1000s of files to copy. It would typically take hours to finish the job in that case. That's OK, because I could issue my command, and then get busy in something else, and let my machine do the copying. The problem arises when something goes wrong in between. Sometimes there would be a customary error message which isn't much helpful. If there is none, a crude way to find that out is to compare the sizes of the source and destination folders. You may find that they aren't the same! Someone please tell me what went wrong! Which file couldn't get copied? Did it just give up there and proceed? Where there multiple failures? So, now do I have to do it all over again? How do I know it won't fail this time?

That motivates me to have a script which does it a bit more systematically. In the sense that if something goes wrong, it leaves me with some information about where it went wrong so that I can go and fix it. Thereafter, when I resume my backup process, it shouldn't just start all over again, because that would be such an awful waste of time. Instead, it should just skip the part which has got backed up successfully, and copy those parts of my data which didn't get backed up.

Actually, this is technically called data synchronisation, or more precisely, file synchronisation.


Another important feature I wish it to have is that it doesn't result in proliferation of data in the backup drive. Let me try to briefly explain what this particular problem is and how we intend to handle it.

In the next few paragraphs, I will explain the problem of data synchronisation in some more detail. I will explain the ideas which lead to the algorithm as I have implemented in the script. However, if you are here in search of the script, by all means, feel free to skip over to the source code. 

Data Synchronisation

What are we trying to do in backup? As hinted above, we are just trying to create a copy of some data in a source location S, into a destination location D. In the process, we also want that all items that get copied must be related to each other in the same way both in D and S. For example, if there's a file f somewhere in S, it should also be found in D. Additionally, if there's a directory d somewhere in S, with a file f and a directory d', then we expect that in D, we would find a d, with file f and directory d' contained in it.

A good way to look at this logical arrangement of data is the tree data-structure. Treat S and D as the roots of two trees. Each file and directory in S and D are nodes of the respective trees. The files and directories contained in a directory d are its child nodes, and d is their parent. Since directory nodes can have child nodes, they are non-leaf or internal nodes. And since files can't contain anything else, they can't have child nodes. Thus, they are called leaf  nodes.

With the above terminology in place, we could rephrase our description of the back process concisely as follows: At the end of the backup, we would like to have the tree rooted at D look exactly the same as that rooted at S.

Declaratively, the process to achieve it is ridiculously simple:


  • Each file f in S should have a copy f in D too. Each directory d in S should have a copy d in D too.
  • Let D.d be the copy in D of a directory d in S, denoted likewise by S.d. For every node S.n (whether a file or a directory) which is a child of S.d, its copy D.n should be the child of D.d.
  • For every file S.f and D.f, D.f should be copied from S.f. This implies that if S.f is found newer than D.f, S.f is copied to D.f. Otherwise, the copying is skipped.
One more step of rephrasing the thoughts, and we have the following:
Any node n in S (again, denoted by S.n) is synchronised if:
  • There exists D.n.
  • If S.n happens to be a file, then D.n is newer than S.n.
  • If S.n happens to be a directory, then all its children are synchronised.
The above way of writing (particularly the last point) brings out a very important recursive structure of the solution. To ensure that a directory is synchronised, all we have to make sure is that all its contents are synchronised. In other words:

proc synchronise(directory d)
    forall(n in d)
        synchronise(n)
    endfor
end proc

The above procedure makes explicit the recursive nature of the synchronisation algorithm. Note that synchronise procedure calls itself as many times as there are nodes in d.
The algorithm to achieve the above post-condition is a bit more complicated. Its implementation in shell-script is significantly more complicated with my knowledge level. It would be very exciting if someone provided a much simpler implementation.

The Script

So, here goes:
#!/bin/bash

TRUE=1;
FALSE=0;

sourceRoot=""; # fill in appropriate value
destinationRoot=""; # fill in appropriate value
backupDirectory=""; # fill in appropriate value

function backupDir(){
 local prefix="$1"; # name of the directory to be backed up.
 local completeName=$(getCompletePathName "${prefix}"); # complete name of the directory to be backed up. 
 if [ ! -d "${completeName}" ]; then
  echo "${completeName} is not a directory. Returning ...";
  return 1;
 fi

 echo "entering directory '${completeName}'";
 local suffix=${completeName#"${sourceRoot}"}; # the name of the directory relative to the sourceRoot.

 if [ ! -d "${destinationRoot}/${suffix}" ]; then # ${destinationRoot}/${suffix} is the complete name of the destination folder
#  echo "creating destination  ${destinationRoot}/${suffix}";
  mkdir "${destinationRoot}/${suffix}"; # there should be a way to check if the make directory succeeded.
 fi
 
 isDirectoryEmpty $completeName;
 local result=$?;
 if [ !  "$result" == "$TRUE" ]; then
  for name in ${completeName}/*; do
 # for name in `find $completeName -maxdepth 1`; do
   if [ "$name" == "$completeName" ]; then
    continue;
   fi
   local inBetween=${suffix%"${name}"};
   isSubdirectory "${completeName}" "${name}";
   local result=$?;
   local filename=`basename "$name"`;
   if [ -d "${name}" ]; then
       backupDir "${name}";
   elif [ -f "${name}" ]; then
    local destinationFileName="${destinationRoot}/${inBetween}/${filename}"
    if [ -f "$destinationFileName" ]; then
     whichFileIsOlder "${name}" "${destinationFileName}";
     local older=$?;
     if [ "${older}" -eq "2" ]; then
#          echo "copying file ${name} to ${destinationFileName}";
          cp "${name}" "${destinationRoot}/${inBetween}"; # there should be a way to check if the file copy succeeded.
#     else
#          echo "No need to copy ${name}.";
     fi
    else
#     echo "copying file ${name} to ${destinationFileName}";
     cp "${name}" "${destinationRoot}/${inBetween}"; # there should be a way to check if the file copy succeeded.
        fi
   else
    echo "Something wrong with $name";
    exit 1;
   fi
  done;
 fi
# echo "exiting directory ${prefix} ...";
}

function whichFileIsOlder() {
 local file1="$1";
 local file2="$2";
 local sdate=`date +%s -r "${file1}"`;
 local ddate=`date +%s -r "${file2}"`;
 local diff=`expr ${ddate} - ${sdate}`;
 if [ "${diff}" -lt "0" ]; then
  return 2;
 else
  return 1;
 fi
}

function isSubdirectory(){
 if [ ! $# == 2 ]; then
  echo "isSubdirectory : function takes 2 parameters; your provided $#.";
  exit 1;
 fi
 local dir1=$(getCompletePathName $1);
 local dir2=$(getCompletePathName $2);

 local l=`expr length "${dir1}"`;
 local p=${dir2:0:${l}};
 if [ "${p}" == "${dir1}" ]; then
  return ${TRUE};  
 fi

 return ${FALSE};
}

function createParentDirectories(){
 local name="$1"; # name  of the directory the directory structure of whose ancestor directories has to be created in the destinationRoot.
 local completeName=$(getCompletePathName "$1"); # complete name of name
 isSubdirectory "${sourceRoot}" "${completeName}";
 local result=$?;
 if [ ! ${result} == ${TRUE} ]; then
  echo "createParentDirectories : Can't create the parent directories because ${name} is not a subdirectory of ${sourceDirectory}.";
  return;
 fi
 local suffix=${completeName#"${sourceRoot}"};
 if [ "${suffix}" == "" ]; then
#  echo "createParentDirectories : No directories to be created.";
  return;
 fi
 local inBetween=${suffix%"${name}"};
 if [ "${inBetween}" == "" ]; then
#  echo "createParentDirectories : No directories to be created.";
  return;
 fi
 local subDirectories=( `echo ${inBetween} | tr "/" "\n"` );
 local dirName="${destinationRoot}";
 for n in ${subDirectories[@]}; do
  dirName="${dirName}/${n}";
  if [ ! -d "${dirName}" ]; then
#   echo "creating directory ${dirName}";
   mkdir "${dirName}";
  fi
 done
}

function getCompletePathName(){
 if [ ! $# == 1 ]; then
  echo "completePathName : function takes 1 parameters; your provided $#.";
 fi
 local name=$1;
 local completeName=$1;
 if [ ! ${name:0:1} == "/" ]; then
  completeName="`pwd`/${name}";
 fi
 echo "${completeName}";
}

function isDirectoryEmpty(){
 if find "$1" -maxdepth 0 -empty | read;
 then
  return $TRUE;
 else
  return $FALSE;
 fi
}

function backup(){
 echo "backupDirectory = $backupDirectory";
 if [ "$backupDirectory" == "" ]; then
  echo "backupDirectory can't be empty. Quitting ..."
  return;
 fi
 if [ "$destinationRoot" == "" ] || [ "$sourceRoot" == "" ]; then
  echo "destinationRoot or sourceRoot variable can't be empty. Quitting ...";
  return;
 fi
 createParentDirectories "${backupDirectory}";
 backupDir "${backupDirectory}";
}

Additional Resources:

rsync Utility (thanks Sumantro)