Please note, interlinux.org.uk (IEF) is temorarily re-directed to this site while a new site is prepared. 21st May 2014

DIASER Manual v1.1.1-05 DIASER beta-3 - Technical Manual v 1.1.1-06
17/07/2014

DIASER is for long term digital archive storage, it securely...

1) Accumulates
2) Geo-Duplicates
3) Manages

diaser overview

Creative Commons License
This work by Damian L Brasher is licensed under a Creative Commons Attribution-Share Alike 2.0 UK: England & Wales License.

Internet Engineering Foundation
Sponsored by IEF - Download Software and White Paper
Alternative SoureForge Download

Fedora 17 - 20 users can install DIASER via the standard update mechanisms, use yum install diaser

Index

1 Introduction
1.1 Feature overview

2 Explanation of the overall design
2.1 Design philosophy
2.2 The storage architecture
2.3 Integrated approach
2.4 Limitations
2.5 Why Linux?

3 The package and contents
3.1 Downloading and unpacking
3.2 Main source file
3.3 Configuration files
3.4 Example backup software configuration
3.5 Licence
3.6 Documentation

4 Requirements
4.1 Hardware
4.2 Software
4.3 Skills

5 Primary scripts
5.1 diaser
5.2 tab_$.pl
5.3 hvautoc_$.pl and description of rsync options (heavy lifting)
5.4 fill_diaser.pl

6 Explanation of features
6.1 Geographical distribution
6.2 Security
6.3 SE Linux and AppArmor
6.4 Upgrade and modify
6.5 Filling or loading
6.6 Non distinct binary volumes
6.7 Logging
6.8 Archive retrieval
6.9 Data and node migration
6.10 Reporting and monitoring
6.11 Multiple instances
6.12 Extending operation
6.13 Pruning old volumes
6.14 Time zone compensation and leap years
6.15 Digital volume check-sum or stamp
6.16 Complete removal

7 Configuration
7.1 diaser.conf
7.2 Number of years of expected operation
7.3 First year of operation
7.4 Start time of phases
7.5 Node IP address's
7.6 OpenSSH ports
7.7 Dry run mode
7.8 Lowest maximum bandwidth (LMB)
7.9 Time zone compensation
7.10 Working diaser account name
7.11 Time out
7.12 Home directories
7.13 Fill start time
7.14 Volume directory
7.15 Differential or constant name prefix
7.16 Collect Full volume or not
7.17 Collect Full volume on which day
7.18 Full volume prefix
7.19 More than one configuration file

8 Installation

9 Command Line Options
9.1 --help
9.2 --bandwidth
9.3 --configure
9.4 --extend
9.5 --install
9.6 --list
9.7 --lock
9.8 --logs
9.9 --migrate
9.10 --modify
9.11 --pause
9.12 --recreate
9.13 --remove
9.14 --resume
9.15 --retrieve
9.16 --stats
9.17 --stop
9.18 --upgrade
9.19 --version

10 Operation
10.1 Stop
10.2 Pause
10.3 Resume
10.4 Hard Lock
10.5 Migrate node

11 The Code
11.1 Why Perl?
11.2 Style
11.3 Modules
11.4 Error handling
11.5 Contribute

12 On-line resources
12.1 Website
12.2 SourceForge
12.3 Mailing list
12.4 DIAP/LTASP and early project memory

13 Developers

14 A Personal Installation Scenario - real world example

APPENDIX

A Tables and calculations
B Glossary of terms
C Appliances
D Other secure network data transfer tools
E Useful commands for archiving
F Fedora Project

1 Introduction

DIASER is for long term digital archive storage, it securely...

1) Accumulates
2) Geo-Duplicates
3) Manages

diaser overview

DIASER has been created to solve mid-range and below, long term archiving requirements of the SME, a data vault application. Where tape has been deployed in the past DIASER now offers an alternative solution designed to be more robust and manageable in the long term than simple NAS devices or disk based storage alone. This manual is designed to assist the systems administrator providing; a detailed technical overview of the system and it's components parts, how to plan deployment, installation, storage space calculations, an overview of the code base and other available resources.

1.1 Feature overview

Cloud based computing has taken off the last few years. DIASER is an ideal application for cloud computing deployment as well as an archiving framework solution. Once implemented the system is invisible to users but allows them to do more. Cloud computing is a popular term, a useful way of communicating a complex collection of technologies. The use of virtual machines in a distributed environment has many advantages. The problem many people foresee with cloud computing is lock-in-in and loss of control of data and increased cost of services. DIASER allows an organisation to build private storage clouds using existing resources as you will see in this technical manual. The result is control over your long term could based storage in terms of administration and resources as soon as the system is deployed and beyond. This means that data can be migrated when you want to without penalties from a 3rd party provider.

With security in mind at all times DIASER is based on a carefully designed robust storage architecture called LTASP, Long Term Archive Storage Protocol. This means consistency is ensured now and in the future. The design phase involved four years of careful evaluation and testing. DIASER is open source software using GPL the GPL v3 licence model so users can enjoy the benefits the of open development methodology. Simplicity of design and reuse of code and readily available resources is key to power of this system. A strong design philosophy has been cultivated and adhered to for the benefit of all users. DIASER is written by a systems administrator for systems administrators but potential benefits to an SME, it's IT manager, CEO and committee have been the highest priority throughout all stages of the design process. The DIASER implementation is targeted primarily at education, hence the name Distributed Internet Archive System for Educational Repositories however the system can be downloaded and deployed by any SME. DIASER is designed to be extremely future proof. As an Open Source product minimise the risks associated with vendor lock-in and data retrieval.

More features are planned for the future and the most current development road-map can be viewed here: http://diaser.svn.sourceforge.net/viewvc/diaser/ROADMAP_DEV.

2 Explanation of the overall design

2.1 Design philosophy

Archiving and backup is art and science. For me a philosophy has evolved over the years I have been a systems administrator and I applied them to the design of LTASP and DIASER:

Maximise: Storage capacity, availability of data, data restoration and recovery speed, scalability, modularity, cross-platform deployment, resilience and robustness.
Minimise: Operating bandwidth overhead, impact of network outages, management overheads, support costs.
Simplify: Development cycle, deployment, data recovery, operation, integration with existing systems.

2.2 The storage architecture

To maintain archives over a number of years requires organisation. For this reason DIASER builds a set of slots/directories on each node in advance which correspond to date. This is done in advance and not generated as required for a number of reasons. As the system operates across networks and network connections can have variable performance or be down completely creating a year or more slots (slots roughly equate to a single tape) of storage upon installation ensures that the directories are named and therefore dated correctly. This ensures if data is not copied correctly we can identify failure even without log data. Log data may or may not be created on a node but empty slots are indicative of copy or network failure. Computers are not the best time keepers left to their own devices. If the storage structure creation is undertaken when all nodes are known to be synchronised then accuracy of the storage structure is ensured. If slots were created on the fly and node time was not synchronised for any reason, bios changes, other software changes the time inadvertently and so on, inaccuracies could occur. The structure is human readable too and simply put; empty slots are easier to read and parse than missing slots.

Storing old archives in a well defined data storage structure is very important. This means DIASER can be deployed in the past, i.e. 2007 onwards. Then the system can be manually filled with old data like a filling cabinet and default automatic operation simply continues.

The system is optimised to store a combination of Full and differential volumes. Fulls created at the beginning of the month and Diffs during the month. However this does not preclude storage of constant volume sizes, i.e. the storage of CCTV video footage but calculations must reflect this kind of storage mode. The recommended data vault operation will make use of certain directory structures in each month; Full01 and Full02. Full01 will store a Full volume at the beginning of the month and skip d1. Full02 is there for additional redundancy and to cope with the scenario where the current month is the last (this is not default behaviour).

There are two parts to the architecture, that described above and the data transfer mechanism. The data transfers are initiated by an internal structure called the hyper virtual auto-changer, a virtual concept drawn from the mechanical tape changer. The well used tool rsync is a key component of this mechanism and it's features are utilised fully. DIASER installs onto three Linux nodes for optimal data storage resilience. No parity is used, this means complete data can be stored and retrieved if a single node is isolated from others.

DIASER can be managed from any Perl 5.8.8 and network enabled workstation or from a node if preferred.

This section can be skipped and here for the very technically minded. Taking a deeper look at the architecture, also see section 6.4 filling or loading. Nodes A and B both contain d0's. This structure allow copy phases to simply and accurately span different days, if data was set to be copied directly from Node A d5 then midnight passed +-1 day will have to be factored depending on the point of reference - node$. Filling of DIASER can then occur well in advance thus keeping the copy phases operationally contained and therefore greater control over operation, implementation and readability. The filling occurs outside the LMB calculations and can be at a much slower rate. This means LMB calculations remain applicable to both phases. d0 also acts as a buffer; original copies exists if an internal copy fails, allows simultaneous copies i.e. A; d0->d5 and B; d0->d0 otherwise the second copy would have to wait and begin safely only after completion. d0 can be tested for a successful fill before phases begin.

The concept of node role assists towards an optimised architecture, which differs depending on the node role. To allow the roles to be practically changed and for simplified fail-over implementation the directory structure is identical on each node, whether it is node A, B or C. The difference between roles is subtle but important:

Role A: uses d0 contain in each month, designed to be closest to original backup volume source. Utilised in phase1 only.
Role B: Utilised in phase1 and phase2. Only accepts data during phases.
Role C: Utilised in phase2 only. Only accepts data during phases.

2.3 Integrated approach

DIASER makes use of existing resources where possible. This results in streamlined software tightly integrated with the POSIX, Linux computing environment. Using Perl for this task ensures GNU tools are used for tasks instead of re-writing functionality unnecessarily. Use of the the common Linux home directory environment, cron, OpenSSH and rsync. Perl is commonly installed on most Linux operating systems by default and only the core is required on the storage nodes. This allows for very simple installation and management. By using user space the system is contained and a layer away from it's host root environment which has many positive implications not least better security and deployment modularity. DIASER will store backup volumes generated by most backup software products, at least all those that can write volumes to disk lessening operation, integration and installation overheads. Volumes are are defined as resembling a single tape entity.

2.4 Limitations

Storage space is limited by bandwidth. At my reference installation site I spent half an hour with the IT manager to decide the relative importance of the organisations data. To this end we managed to select about 30% of all data generated on a regular basis and pipe this into DIASER. This practical approach coupled with compression, data-data de-duplication may be available, means that the organisations critical data is stored using DIASER. Node A is a single point of failure. This is the node in network terms closest to the backup server and if it failed data will cease to transfer. However plans exist to allow node A bypass. Even if node A did prevent data transfer it is expected the systems administrator has the skills and access to resolve any issues.

2.5 Why Linux

Linux should not be underestimated for its appropriateness as a storage platform for many reasons. The cost of obtaining Linux is very low and essentially free as in libre and to obtain and use, supported versions can be very good value too. Linux is widely available and has lightweight resource requirements. Licence issues are avoided. Organisations that need the flexibility of deployment with low initial purchase costs can do so when they deploy Linux. Linux is extremely robust under most circumstances, i.e. the ext3 file system under normal circumstances does not require regular de-fragmentation which makes it ideally suited to storage environments. Many of the tools required to enable DIASER are included in standard distributions, even small installations without a GUI or a windowing system. This means DIASER is streamlined, lightweight and does not attempt to needlessly duplicate existing code, i.e. rsync.

3 The package and contents

3.1 Downloading and unpacking

DIASER is currently supplied by anonymous download from SourceForge as a diaser-1.0.$.tar.gz (this contains everything in the subversion repository), rpm, dist-tarball or deb installation. rpm dependencies will be automatically installed with yum. Makefile as root will allow installation; make, make install. deb package still requires extra dependencies. See INSTALL and section 8 of this manual.

3.2 Main source file

diaser - this file unpacks more embedded scripts which are sent to the nodes upon installation, modification and upgrade.

3.3 Configuration files

diaser.conf - this is the main configuration file. See section 7 for configuration guidance. A second configuration file can be created manually for development or second deployments. Keep your configuration files in separate directories or rename them. If no configuration file is present then the default values set in diaser will be used, this will not lead to successful deployment.

Also see section 7.19 for use of more than one configuration file.

3.4 Example backup software configuration

helper_scripts/bacula-dir.conf.extract

To fill DIASER with backup volumes created by backup software you need to name volumes in a certain way. This example configuration comes from the Open Source backup software called Bacula. If you use Bacula you can implement volume creation is an identical fashion. If not then use this file as a guide. The scripts generated by the installer residing on node A are called fill_diaser.pl. As the names suggest these collect volumes generated by your backup software, perhaps stored on a share mounted by node A or directly backed up to node A, and fill DIASER with pre-defined named volumes.

If not using Bacula or similar, use of archiving tools is recommended:

tar -czf name_of_archive.tar.gz file_or_dir_to_archive
tar -xvf name_of_archive.tar.gz

Creates a compresses tarball of a file or directory and leave file_or_dir_to_archive in place. Use the second command to open the archive.

gpg -c archive_file
gpg --output chosen_file_name -d archive_file.gpg

Uses a symmetric cipher to encrypt, outputs file_to_archive.gpg and leave file_to_archive as it was. Use the -d switch to decrypt.

zip -7 -P password name_of_archive file_to_archive
unzip name_of_archive

outputs name_of_archive.zip and leave file_to_archive as it was. Uses unzip to decrypt. Using plaintext password to use with care.

3.5 Licence

This software is licenced under GPL V3 - gpl.txt and fdl-1.2.txt. The website is licenced under fdl-1-2.txt

The manual, DiaserSystem.png and DiaserDocsv1.1.pdf are licenced under Creative Commons Attribution-Share Alike 2.0 UK: England & Wales Licence.

3.6 Documentation

Located in directory docs. This includes this technical manual docs/manual.txt .html or .pdf and diagrammatic overview docs/overview.png.

Importantly INSTALL contains a quick start guide.

More theoretical documentation is available from http://www.diap.org.uk and don't forget to check http://www.diaser.org.uk for up to project date news and other information.

A man page is also installed.

4 Requirements

4.1 Hardware

Workstation, 1GHz CPU or above, 500MB Ram and network connection. You can also use a node as as the installation platform but you need to ensure all the Perl modules listed below for the workstation are available.

3 x Linux storage nodes (can use VM's) with root access for initial setup. Anything above 1GHz 32bit or 64bit with 500MB Ram. Enough disk space. I'll make all this much simpler to calculate when I have finished subroutine calculate_lmb, see appendix A, tables and calculations.

LAN or WAN connection between each server and workstation, the 3 machines must be able to, at least notionally, ping one another. Nodes can be connected across a Virtual Private Network if necessary.

4.2 Software

Minimum Perl v5.8.8 enabled (Perl v5.10.0 is recommended for best performance) workstation with Perl modules:

Net::SSH::Perl, Net::SFTP, Getopt::Long, AppConfig, Term::ReadKey and Data::Password. Optional for the --bandwidth tool gnuplot v4.2.

Install modules i.e. as root ]#yum -y install perl-Net-SSH-Perl or cpan>install Net::SSH::Perl

Automatic module installation occurs when installing using the rpm release.

Nodes Perl Core (v5.8.8 or above) File::Find (installed as default with most distributions). SSH server on each node, not necessarily port 22.

Each node must run services; sshd, crontab, iptables ssh port open, ntpd, rsync (non daemon).

4.3 Skills

It is recommended the administrator have at least these skills:

Bash command line - ability to move around directories, create files and directories, set permissions and add and remove user accounts. Knowledge of SSH logins, text editor and adding and removing software. Basic knowledge of rsync and the ability to effectively use scp. Use of commands less and cat. Ability to install Perl modules and check versions.

Less important are some Perl scripting abilities, Basic bash scripting skills may also help.

5 Primary scripts

5.1 diaser

The primary script containing most of the DIASER code. Code embedded within diaser is unpacked and copied to nodes with variables set by the user. For upgrades and configuration changes code is again unpacked and copied over to nodes as required.

5.2 tab_$.pl

One for each node and contains the crontab definitions which trigger the internal diaser data copies managed by the scripts hvauto_$.pl. The cron job run every hour i.e. 0 * * * * ~/hvautoc_a.pl and the script reads the local system time, compares the the user set copy phase and if there is a match will initiate data transfer. The script logs to the node, log_$, as does rsync.

5.3 hvautoc_$.pl

Each node has a single hvautoc_$.pl script. This script is triggered every hour and depending on the times set by the user variable, HOUR1 and HOUR2 they initiate the rsync data transfers. If the user modifies variables then these updates can be copied to the nodes by replacing the hvautoc_$.pl scripts.

A description of rsync options used for heavy lifting

rsync -anzv -e "ssh -p PORT_NUM" --timeout=SECONDS --numeric-ids --stats --ignore-errors --bwlimit=KBPS /start/dir/ diaser_acc@node_x:/destination/dir/ >> log_file

options

-a = -rlptgoD (not -H,-A,-X i.e. no hard link preservation, ACLs or extended attributes)
  -r = recursive operation
  -l = symlinks kept as symlinks
  -p = preserve all permissions
  -t = preserve modification times
  -g = preserve group ownership
  -o = preserve ownership (super-user only)
  -D = transfer character and block device files & special files

-n = (dry run) not used for normal operational status
-z = compress data during transfers
-v = increase verbosity - the v level can be adjusted
-e = remote shell command
-p = remote shell port

--timeout=SECONDS
--numeric-ids = don't map uid/gid values by user/group name
--stats = generate file-transfer stats
-ignore-errors = instructs (-delete) to go ahead and delete files even when there are I/O errors
--bwlimit=KBPS = this is the maximum transfer rate in kilobytes per second

5.4 fill_diaser.pl

This script resides only on node A. This is responsible for filling the correct slot with data fed into DIASER by the user. The script is called by cron job set when configuring or modifying DIASER. The script copies the latest created of either Full, Differential or constant volume types to the DIASER directory to either Full01 or d0. Aside from the cron job time there are a number of variables that can be user configured including the volume directory, that is where your backup software stores volumes and the volume prefix, i.e. fullbackup... for Full volumes.

Filling is designed to be as simple as possible. Volumes on your file store are assumed to be read/write by user id: $your_diaser_uid. This flow chart provides a detailed overview of the fill process, everything apart from the node A->B copy check has been implemented:

fill diaser flow chart



fill_diaser.pl automatically clears out the drop off directory ad0 after the contents of which would normally have been transferred to other slots as specified by the architecture.

6 Explanation of features

6.1 Geographical distribution

Tapes can be moved from site to site and often are. To emulate this ability distributing data provides geographical redundancy. A simple mirror of a NAS device is one way to achieve this but to spread over three nodes can be difficult to manage. DIASER is a self contained wrapper around the long term archiving across three nodes. We believe the extra resilience provided by storing in three geographical locations give your archives the protection needed for long term planning and data retrieval. Ensuring your archives are safe means a better chance of recovering data when you need it. Being a disk based solution will help render your data more accessible in may scenarios. Planning your installation is important and as the system may run for years spending time before deployment will pay off. DIASER is ready for trail and evaluation. Your chosen storage nodes may also be equipped with RAID. This is highly recommended.

6.2 Security

These security precautions have been implemented: The primary script, diaser, does not store any passwords on file. Passwords are stored in memory temporarily while the script runs. When a password is requested the entry view is hidden. New DIASER account passwords are quality checked and a warning given if not secure. Root passwords are only requested when the system is installed and removed. DIASER exists and runs in user space. All network communication is handled by OpenSSH. A unique RSA certificate is generated so the nodes can use password-less logins to transfer data and communication during normal operation. Password-less login certificates can be regenerated using the modify switch --upgrade. A kind of emergency account lock can be initiated with the switch --lock.

The perl module Net::SSH::Perl and Net::SFTP are used for all SSH communications and file transfers initiated by the system. Rsync uses SSH to transfer data. It is possible to use different port to the standard SSH port 22 and individually set these for each node.

An sha256sum checksum and a date stamp file is created a every volume enters DIASER in a format similar to:

4865c5bdf3cf64709acd797688db5b337e7c8643
2009/mth7/Full01/fullbackup7
Tue Jul 21 07:10:28 BST 2009

For extra security DIASER can run within a Virtual Private Network. It is recommended encrypted partitions are used for DIASER, i.e. when deploying an external USB hard drive.

/dev/sdb can be an externally attached USB2 hard disk drive i.e. replace with the disk chosen on your system.
# Create a new partition on the disk

fdisk /dev/sdb

# Generate a mapping and LUKS partition

cryptsetup --verbose --verify-passphrase luksFormat /dev/sdb1

cryptsetup luksOpen /dev/sdb1 sdb1

# Format the partition

mkfs.ext3 -j /dev/mapper/sdb1 

# Mount the partition for the first time

mount /dev/mapper/sdb1 /mnt/crypt/

df -h

# Open and mount the device after reboot or disk removal

cryptsetup luksOpen /dev/sdb1 sdb1

mount /dev/mapper/sdb1 /mnt/crypt/

# Umount and close

umount /mnt/crypt/
cryptsetup luksClose sdb1
6.3 SE Linux and AppArmor

No problems observed during either installation or operation.

6.4 Upgrade and modify

Currently modify switch, see below, is still under review. For now the upgrade switch sends modifications and upgrades to the nodes. This does not and will not modify the archive storage directory structure. Changes to settings and development improvements can be sent using this option. If you use newer version than your previous then follow these steps:

1) rename your current diaser_rel
2) unpack the download, see section 3.1
3) copy your previous diaser.conf to the new diaser_rel
4) run ]$diaser --upgrade to update your DIASER installation

6.5 Filling or loading

See section 5.4.

The initial entry point for data, d0 (node A, directory 0), resides in each monthly segment and not a single d0 in the root directory. This lessens the risk of deleting or overwriting archive data that may not, for whatever reason, have been successfully transferred to the other nodes. If connection to node B fails there will be at least two copies of the file in d0 and d30 or whatever the last day of the month happens to be, before another Full is generated and the next months d0 is cleared and filled. This adds more resilience at little extra cost. Also, if copies are only set to occur once a month and the copy failed as before and this was not noticed until after the next copy last months data will have been deleted and only a single copy stored.

6.6 Non distinct binary volumes

The volumes which have been described are binary files, like those created by Bacula. Other backup software generate directories which need to some processing before they can be collected by DIASER.

There are a number of problems to avoid to ensure DIASER operates non-destructively, so instead of manipulating the directories in your data store I suggest you use a script to create tar volumes of the archives you want to be collected. Here is a psudo code suggestion of how this might be achieved.
    # non distinct binary volume alternative collection
    # run as a cron job independently of DIASER

    sub non_full_binary {
        look for directories, if directories
        ls
        if($directories) {

        check for a previous tar Full
         
        -> if no Full this month then tar/shasum/date 
        any directories collected for Full -> Full01 slot and
        name with the chosen Full volumes prefix.
        
        check for a previous tar Diff
        
        -> if Full this month then create a 
        tar/shasum/date Diff against it for the day slot
        name with the chosen Diff volumes prefix.
    }

6.7 Logging

Log files are kept on all nodes and named log_$ where $ is the node; a, b or c. The scripts hvautoc_$.pl, fill_diaser.pl and all rsync transfers log to these files. The log files are created automatically as soon as the system begins operation. All entries are contain [diaser_hvautoc_$] or [diaser_fill] where $ is the node; a, b or c.

6.8 Archive retrieval

Either use the simple tool provided using the --retrieve option, which also has additional command line options or login to nodes directly and use scp. The retrieval tool will walk you through a set of questions then list files for you to pick and transfer. The file will retain it's name and be located in the diaser_rel directory.

If using cp, scp, rsync or other native tools. The directory structure is human readable and matching the required date to directories can be easily achieved i.e on node $ the archives stored on date June 25th 2009 can be found in ../diaser/2009/mth6/d26. Navigate to the directory and copy the contents to the required recovery destination. It is assumed you have the tools to extract your data provided by your backup software vendor. It is recommended you also archive any backup catalogues or tools generated and provided with your usual backup software.

6.9 Data and node migration

Node migration can be achieved using the --migrate tool.

6.10 Reporting and monitoring

Bandwidth throughput calculations can been made using the --bandwidth tool. See section 9.3 for more details. This is an example screenshot of the ouput:

fill diaser flow chart

6.11 Multiple instances

Share disk space with other organisations or groups by using a different account name and staggering or alternating the transfer times (phases) or lowering the LMB - lowest maximum bandwidth between nodes. See diaser.conf. diaser will allow the use of more than one configuration file. See section 7.19.

Also if more than one pair of phases is required, i.e. a morning session and an night session than two instances on the same nodes will archive at alternative phase times. If one instance contains FULL volumes then the second does not necessarily need to archive these as well thus saving disk space.

6.12 Extending operation

Operation can be extended. Minimum recommended is two years. You can set DIASER to install to 10 or even 20 years, which means 10-20 years of archive directory structure will be created. Deployment can represent the past if required then manually filled with previously generated archive data.

6.13 Pruning old volumes

Not yet implemented. This will allow the user to remove old archives from DIASER freeing up disk space.

6.14 Time zone compensation and leap years

Time zone compensation allows all the nodes to work together across time zones. The user is asked for the time zone in UTC+(integer). UTC +/- integer value for node A, B and C; if node A is BST = UTC+1, so use 0 as daylight saving is usually automatic on most systems. For three servers in the same time zone use the same offset integer value for each node.

The scripts hvautoc_$.pl all contain an algorithm that will ensure proper interpretation of leap year occurrences.

6.15 Digital volume check-sum or stamp

Generating a unique check-sum or stamp and date stamp as a volume enters DIASER to be stored along side the volume.

6.16 Complete removal

This will completely remove all DIASER components and all archive data stored within the system. Data recovery is not possible after this operation has been performed.

7 Configuration

7.1 diaser.conf

This supplied configuration can be adjusted to suit your deployment requirements. Each parameter is in uppercase the name of which must not change. Change the values to the right of each parameter with a space in between. The default values are there to guide you for your choice. i.e. NODE_A 0.0.0.0 can be interpreted as NODE_A 192.168.2.1. Use the same case and value type for your chosen values as the defaults.

7.2 Number of years of expected operation

NUM_YEARS

Minimum recommended 2 the default is 3.

7.3 First year of operation

START_YEAR

This is the year when DIASER begins operation. Would usually be the current year.

7.4 Start time of phases

HOUR1

HOUR2

DIASER operates in two phases. Phase one identified by HOUR1 and phase two identified by the variable HOUR2. The operation is split into two phases, these can be at any time over a 24 hour period. It is assumed that the start time is based on your local timezone, i.e. BST or UTC+1. It is recommended to set the phases to early in the morning to avoid using day time bandwidth resources. Once set the operation can be reset by sending a new configuration from diaser. The operation is fixed for at the same time every day once set. Using two phases optimises the use of resources when transferring internally on a node and between nodes and prevents simultaneous transfers from interfering with each other as well as simplifying the management and tracking of transfers.

7.5 Node IP address's

NODE_A

NODE_C

NODE_B

7.6 OpenSSH ports

PORT_A

PORT_B

PORT_C

Change from the default port 22.

7.7 Dry run mode

DRY_RUN

Copies are initiated but no archive data is transferred. This can be used for testing, debugging and trails.

Can be toggled at any time and the new setting transferred as for all settings in this section.

7.8 Lowest maximum bandwidth (LMB)

LOW_MAX_BW

BANDWIDTH control, please enter the Maximum speed in KBPS of your slowest network connection between either A->B or B->C or C->B. I recommend you run some test transfers between nodes using scp, also don't assume the bandwidth will remain constant throughout the cycle so you may need to run some long term viability tests. This feature will be implemented automatically with the subroutine calculate_lmb(). Adjust if you install more than one diaser instance on a single disk or machine. Default is 12500 KBytes per second / 100 Mbits per second

7.9 Time zone compensation

For deployments that span different time zones. UTC +/- integer value for node A, B and C; if node A is BST = UTC+1, so use 1.

TZONE_A

TZONE_B

TZONE_C

7.10 Working diaser account name

USER_ACC

Choose a name for your DIASER user accounts. The same name will be used on all three nodes. Limit this to between 5-10 lower case characters for simplicity. I use diasertest for example.

7.11 Time out

TOUT

The copy timeout used by rsync for transfers. Set lower than phase periods.

7.12 Home directories

DIR_A

DIR_B

DIR_C

Home directory of diaser account, you may need to adjust if a large partition is not in the usual home directory place i.e. /mnt/big/ will evaluate as /mnt/big/diaser.

7.13 FILL_START_TIME

Time to initiate the daily filling script this should be set in advance of the DIASER archive transfer phases to ensure DIASER is filled before the phases begin.

7.14 VOLUME_DIR

Location of volume storage directory is where you store backup volumes created by your backup software.

7.15 DIFF_CONST_PREFIX

Differential or constant volume name prefix.

7.16 COLLECT_FULL

Choose whether full volumes are collected or not you want to simply collect constant sized volumes, like CCTV footage.

7.17 COLLECT_FULL_DAY

Day of moth when full volumes are collected.

7.18 FULL_PREFIX

Full volume name prefix

7.19 More than one configuration file

It is possible to force diaser to read a particular configuration file by executing ]$diaser diaser.conf --opts

The configuration file can named as the user chooses i.e.

]$diaser my.config --opts

Currently, changes will always be written to diaser.conf from the directory diaser was executed in. The user is free to change the name of the configuration file and read it into diaser as described above. This feature is particularly useful when there us more than one installation being managed from a single user account.

8 Installation

]$./diaser --install

Use after you have configured diaser.conf as a normal user. As each task is completed you will be informed. At the end of installation you will need to one time only - you will need to login from the diaser account on each node to accept the certificates between nodes, like the 1st time you SSH into a box. A->B, A->C, B->A, B->C, C->A and C-B. Afterwards logins between nodes are password-less, this step will allow DIASER to begin work. This step may be automated depending on user feedback.

9 Command Line Options

Please note, not all of these operations have been implemented. Please view the most current development road-map: http://diaser.svn.sourceforge.net/viewvc/diaser/ROADMAP_DEV. As such some of these items may change or be removed altogether or others added. Later in the development cycle I plan to extend command line options so configuration changes can be set using the diaser command.

Run all commands from a prompt as a normal user, i.e.

]$diaser --install

9.1 --help

Display menu and command line options.
 DIASER Usage: diaser_setup.pl 

    --help                 help|-?

    --bandwidth            calculate real bandwidth throughput between nodeX-Y
    --configure            question driven configuration tool
    --extend               extend maximum storage structure date 
    --install              install
    --list                 list all volumes in storage
    --lock                 lock all DIASER node accounts
    --logs                 condensed log readings from nodes
    --migrate              migrate node 
    --modify   [opts]      send modified configuration to nodes either
                           from conf file or command options or both
    --pause                pause operation 
    --recreate             recreate a single node from scratch
    --remove               remove from nodes, all data will be lost 
    --resume               resume operation 
    --retrieve [opts]      retrieve archive data 
    --stats                generate statistics 
    --stop                 stop operation 
    --upgrade              apply upgrades    
    --version              show version  
                                              
 For more information please use man diaser or the more detailed
 online manual: http://diaser.org.uk/manual.html

 Please send any FEEDBACK to dbrasher@interlinux.co.uk.
 I'm especially interested in how DIASER may be of use to you now or in the future.
 Thank you.
9.2 --bandwidth

This option will allow you to view the real, not theoretical, data throughput between two of your chosen storage nodes. You will need to have the OpenSource tool, gnuplot, installed on the system from which you are running this application.

This tool will attempt to download and compile the binary NPtcp from the NetPIPE utility suite: http://bitspjoule.org/netpipe/. The tool operates over port 5002 and stats will be collected from the sender.

9.3 --configure

Question driven configure tool for new and existing diaser deployments with input validation.

9.4 --extend

Extend maximum storage structure beyond the currently installed year.

9.5 --install

Install DIASER. See the section 8 Installation above.

9.6 --list

This option lists all volumes stored in DIASER.

9.7 --lock

Lock all DIASER node accounts. The systems administrator will need to reset the passwords for each diaser user account manually.

9.8 --logs

Condensed log readings from nodes.

9.9 --migrate

Migrate node to a different server.

9.10 --modify

Apply modified settings to the running DIASER on your designated nodes. Any changed settings will also be written to diaser.conf.

9.11 --pause

Pause any currently running data transfers on all nodes. Sends kill -STOP.

9.12 --recreate

In case you need to rebuild a node. You should only need to rebuild a node in the event of a disk failure or other non-recoverable node loss. In all other cases please consider using the --migrate (node) option.
--numyear   years of operation required
--startyear year to begin storing archives, this can be in the past
--phase1    hour between 0 and 23
--phase2    hour between 0 and 23
--nodea     ip address in format 0.0.0.0
--nodeb     ip address in format 0.0.0.0
--nodec     ip address in format 0.0.0.0
--dryrun    boolean 1(y) or 0(n)
--lmb       lowest maximum bandwidth, KBytes per second
--tzone     [not yet implemented]
--tout      copy time out in seconds
--fillstarttime     time to run DIASER fill operation, hour between 0 and 23
--volumedir         the directory where your backup volumes reside
--diffconstprefix   prefix given to your Differential or constant volumes
--collectfull   are Full volumes to be collected or not, boolean 1(y) or 0(n)
--fullprefix    prefix given to your Full volumes
9.13 --remove

Completely remove DIASER from your previously designated nodes. Please use with caution as all archive data stored in DIASER will be permanently deleted.

9.14 --resume

Resume paused data transfers. Sends kill -CONT.

9.15 --retrieve

Fetch archived data volumes.

A simple tool provided which also has additional command line options. The retrieval tool will walk you through a set of questions then list files for you to pick and transfer. The file will retain it's name and be located in the diaser_rel directory.

--r_year    which year
--r_month   which month
--r_day     which day
--r_full    if not a day name a full directory - leave as default
--nodea     ip address in format 0.0.0.0
--nodeb     ip address in format 0.0.0.0
--nodec     ip address in format 0.0.0.0
--porta     int 
--portb     int
--portc     int
--user_acc  user account name, usually default set previously
9.16 --stats

Displays for each node in GiB; disk space, total daily volumes, total full volumes and total data stored on each node and average differential volume size.

9.17 --stop

Discontinue data transfers. Sends kill -9.

9.18 --ugrade

Apply product upgrades to an existing nodes with a DIASER installation. Your DIASER account password will be requested.

9.19 --version

Show current DIASER and currently installed Perl version.

10 Operation

10.1 Stop

This option will stop DIASER copies currently in operation, until the next set of transfer operations are initiated. This will kill any rsync processes.

10.2 Pause

This option will pause DIASER copies currently in operation, until the resume options is used.

10.3 Resume

This option will resume DIASER copies currently in operation.

10.4 Hard Lock

Lock all DIASER node accounts. This is a security feature. Enables the operator with root access to lock all DIASER node accounts immediately. Only by logging in to the nodes as root and re-enabling the DIASER account password will access from node to node and hence operation resume.

10.5 Migrate node

Migrate will assist you in moving an existing node from the current machine, server or workstation, to a new one. This may be located anywhere as long as it satisfies the requirements for DIASER inter-node-visibility. The procedure may take anywhere from minutes to hours depending on the amount of data stored on the existing node and network bandwidth available.

11 The Code

11.1 Why Perl?

The language is very well suited to the Linux POSIX environments. It is well supported, has good network programming capabilities. Perls is very flexible and allows a simple yet robust coding environment. Cross platform properties are extremely valuable and ensures the code base is portable. Perls inherent text parsing abilities are also valuable and set the language apart from many other contenders.

11.2 Style

Style is based as much as possible on the excellent O'Reilley Perl Best Practises by Damian Conway. A modular approach is used to code DIASER. All subroutines take parameters derived from the configuration mechanisms. Only three global variables are used, the rest are passed directly to subroutines and returns read back.

11.3 Modules

Popular modules are used where possible. Only modules that are shipped with popular Linux distributions. The installer use a number of modules, the code deployed on nodes only use File::Find (shipped as default with most distributions) and the core Perl shipped as default by most Linux distributions.

11.4 Error handling

Under review.

11.5 Contribute

Please see http://www.diaser.org.uk/contribute.html. All contributions are received under MIT/X licence terms.

12 Online resources

12.1 Website

http://www.interlinux.org.uk

12.2 SourceForge

http://sourceforge.net/projects/diaser

12.3 Mailing list

https://lists.sourceforge.net/lists/listinfo/diaser-devel

12.4 DIAP/LTASP and early project memorybr>
http://sourceforge.net/projects/diap/

http://interlinux.org.uk/?q=node/5

13 Developers

14 A Personal Installation Scenario - real world example

I have chosen three locations to use for backup nodes. Each node is a virtual machine (VM) residing on a host. Each node has openssh running and accessible by root. Appropriate firewalls are in place and certificates used for EC2. DIASER is installed on all nodes yum -y install diaser. This brings in dependencies and the application if required. Each node runs only minimal services including NTP and SSH in runlevel 3.

Things to note:
To create the actual storage volumes - pre-packing:

- Archivers, different Linux compression utilities. Compression strength vs. time and other variables
- Pre-use of encryption, like the simple gpg -c utility to encrypt a St.volume with a symmetric cipher using a passphrase


High level:

- Cloud spanning, which is what I am doing here
- Disk staging policies, like locks on a canal
- More useful tables (I these these at work)
- Fedora spin is fast cutting edge release cycle, so easy to break deps

1) Laptop VM (f16)

Host OS: fedora 14
Hypervisor: VirtualBox 4.1.4
Network visibility: all nodes ssh

diaser overview

2) Shed VM (f16)

Host OS: Debian 5.0
Hypervisor: VirtualBox 3.2.10
Network visibility: all nodes ssh

diaser overview

3) EC2 VM (f16)

Host OS: Amazon EC2
Hypervisor: paravirtual virtualization

Network visibility: self

diaser overview

Config file generated using this command (For a template I used diaser/docs/diaser.conf.sample - ignore uninitialised value errors):
]$./diaser ~/diaser-conf010412 --configure
]$mv diaser.conf ~/diaser-conf010412

NUM_YEARS 20
START_YEAR 2011
HOUR1 11
HOUR2 6
NODE_A 192.168.6.142
NODE_B 192.168.6.143
NODE_C 174.35.106.89
PORT_A 45
PORT_B 45
PORT_C 45
DRY_RUN 0
LOW_MAX_BW 12800
TZONE_A 0
TZONE_B 0
TZONE_C 0
USER_ACC diaserinterlinux
TOUT 10900
DIR_A /home/
DIR_B /home/
DIR_C /home/
USE_SUDO 0
SUDO_ACCOUNT_NAME_A usera
SUDO_ACCOUNT_NAME_B userb
SUDO_ACCOUNT_NAME_C userc
FILL_START_TIME 10
VOLUME_DIR /mnt/pickup/
DIFF_CONST_PREFIX daily
COLLECT_FULL 1
COLLECT_FULL_DAY 20
FULL_PREFIX monthly

As root or sudo, Make a directory called /mnt/pickup/ on nodea, chown ]$diaserlinux.users /mnt/pickup

Installed DIASER from laptop host OS, as user, using this config (obscured IP and ports) ]$./diaser ~/diaser-conf010412 --install

At end of install, wich might take a few minutes, follow instructions to login to each node from eachother, if required. In this example, as user diaserlinux:- nodea->nodeb, nodea->nodec, nodeb->nodea, nodeb->nodec.

Using a regular SCP (cron job later) from my laptop OS to copy backup files for archiving in my personal cloud storage system.

e.g. scp file_to_be_archived.tar.gz diaserinterlinux@192.168.6.142:/mnt/pickup/.

Usage statistics and further advice to follow ...

Happy archiving in the cloud!

APPENDIX

A Tables and calculations

    Bandwidth and capacity lookup table
    ===================================
    BW      Hours
    Mbit/s  1    2   3    4    5     6
    1       0.45 0.9 1.35 1.8  2.25  2.7
    10      4.5  9   13.5 18   22.5  27
    100     45   90  135  180  225   270
    1000    450  900 1350 1800 2250  2700
            GB (Dec)
    Disk space lookup table
    =======================
    BW      Month   1xYr    2xYr
    Mbit/s
    1       324GB   3.9TB
    10
    100
    1000
    Pre-packaging processing times
    ===============================
    Time  
    Tool  
    Size Mbits/s
    Data Type
    CPU type

For more calculations information please use the --bandwidth tool.

Include more calculation examples.

B Glossary of terms

Under review

C Applicances

DIASER-appliance-3node-OVF-test-pak

NB: Re-build required - please check this section later ...

Download Appliance

Getting started:
----------------
Welcome to this 3 node pre-configured DIASER appliance, test pack.

Unzip and import the three appliances into your virtual machine hypervisor. The network is internal only. Images were created using the freely available, cross-platform, VirtualBox. You can also test DIASER whilst using Windows.

Things to try:
--------------
Test data is read from /mnt/backup on nodeA and generated by a cron job, then distributed. You can view logs and other activity by running #diaser diaser.conf --logs from nodeA (logged in as diaser-user with password diaser-user.) Use diasertest when the node password is requested. Also run $man diaser for more options. Explore the working accounts too.

Leave the system running for a few days and watch the test data inside DIASER using --list.

Pack contents:
--------------
3 x OVF images; based on Ubuntu 32bit 10.04.1 LTS
diaser-appliance-nodeA
diaser-appliance-nodeB
diaser-appliance-nodeC
diaser.conf - node construction is based on this config file
appliance_instructions.txt
manual.pdf
--list screenshot


General node specs:
-------------------
256MiB Ram (PAE CPU mode)
Upto 2TB dynamically expanding disk
Internal network intnet
Hostname - diaser
DIASER working account/pass - diasertest/diasertest


Node specific:
--------------
A) IP 10.20.0.1
DIASER user account/pass, diaser-user/diaser-user


B) IP 10.20.0.2

C) IP 10.20.0.3

Security precautions:
---------------------
This is a test pack. Please, if you do decide to put the appliance into a production environment you must change all user account passwords.

NB: The nodeA Perl build has not been performance tuned.

E Useful Commands and Tools for Archiving

Finding files of a certain size
find . -size +10G --exec ls -ltrh '{}' \;

This find command will search recursively from the directory it is run from (.) look for files greater than 10 GiB in size (-size +10G) then run the command to list (-ltrh). -ltrh are l=long listing, t=sort by modification time, r=reverse output, h=human readable.

Exclude directory from search
find . -type f -name "*conditions*" ! -path "./archives/*"

Search for file with the -name search string, excluding the directory ./archives/* in the current directory.

SSHFS - secure shell file system. This tool can be used to operate, view and manipulate, file systems on remote node servers. Login as a diaser user to view the account and browse archives and even retrieve if necessary.

F Fedora Project

Fedora 17 - 20 users can install DIASER via the standard update mechanisms, use yum install diaser

Index