Making a Backup and Disaster Recovery system robust, secure and ready for new challenges.
Steve Gargano, Lead Engineer from Replibit, came to us looking for expertise in Python backends and web user interfaces.
Replibit is a backup & recovery system based on Linux, ZFS (backups), KVM (virtualization) and Python & Angular (backend and web interface). Replibit creates backups from Windows systems and stores these backups in both on-site and off-site servers so they can be recovered at any time.
The system was functional when we were hired. Due to strategical decisions, Andrew Bensinger, CEO from Replibit, decided to find a new Python company to take care of support and development of Replibit Solution.
Replibit’s architecture is actually very clever: a small program called agent is installed on every Windows computer needing backup. This program generates incremental backup images from the Windows system using Shadow Copy. Backup files are transferred immediately over the local network on a fast connection to an Ubuntu Server called Appliance. Replibit’s Appliance offers a web interface where system administrators can manage backups, mount them as devices and even initiate a virtualized windows system from any of the backups.
Additionally, appliances transfer backups to an off-site Ubuntu server, called Vault so that backups can be available even after a complete disaster at the customer location. While it’s perfectly feasible to transfer backup increments from an Appliance to a Vault over internet, it’s often impossible to transfer the initial backup files as they can be extremely large. Replibit’s Vaults offer the ability to import the first batch of backups from appliances onto a flash drive.
After that, we had to get rid of all security issues and make the updating and installation process robust. Replibit software updates itself, so introducing a bug there would result in thousands of servers being unable to update and a big issue for support and system administrators.
But what really helped Replibit to move on to the next phase was the ability to overcome the 2TB limitation in Windows Systems. Modern Windows systems come with GPT instead of MBR. MBR allowed us to address only 2TB of disk size. Replibit’s agent was coded in C++ and was able to receive backups of any size but the virtualization technique used by Replibit was unprepared to boot disks larger than 2TB. On backup file mounting, fuse would execute code creating the MBR needed by the virtualize system in order to boot. Since there was no code for GPT, booting these backups was impossible. This type of feature is extremely difficult to test, since it requires booting a Virtual Machine many times and reviewing byte-by-byte code with absolutely no documentation or help from any other party.
Some tasks we also performed:
- Continued development of a FUSE File System for BDR (Backup Disaster Recovery) that automatically transforms non-bootable device snapshots in exact drive images from the original machine (C++).
- Worked on the frontend and the backend of the application based on Flask, Twisted, Nginx and Python.
- Provided highest level of support to customers assessing data loss and data recovery options.
- Code reviewed the system, while security assessing the application to solve any security issues.
- Overview ZFS issues, data loss, and data migration.
- Created a UEFI Agent Service to allow virtualization on new systems.
When Replibit reached out to us, they had no development process and the source code was not secure. There was no code repository, no way of tracking down changes and nobody to ask. By the end of the project, Replibit had a code repository, correctly tagged with releases and a sturdy development process designed to match their needs regarding new features and bug fixing. Their update process had become robust and functional, and many critical features and fixes were in place, allowing them to be acquired by eFolder and become a trusted solution for system administrators.
- Amazon Web Services