Replibit

Making a Backup and Disaster Recovery system robust, secure and ready for new challenges.

Background

Steve Gargano, Lead Engineer from Replibit, came to us looking for expertise in Python backends and web user interfaces.

Replibit is a backup & recovery system based on Linux, ZFS (backups), KVM (virtualization) and Python & Angular (backend and web interface). Replibit creates backups from Windows systems and stores these backups in both on-site and off-site servers so they can be recovered at any time.

The system was functional when we were hired. Due to strategical decisions, Andrew Bensinger, CEO from Replibit, decided to find a new Python company to take care of support and development of Replibit Solution.

Architecture

Replibit’s architecture is actually very clever: a small program called agent is installed on every Windows computer needing backup. This program generates incremental backup images from the Windows system using Shadow Copy. Backup files are transferred immediately over the local network on a fast connection to an Ubuntu Server called Appliance. A Replibit’s Appliance offers a web interface where system administrators can manage backups, mount them as devices and even initiate a virtualized windows system from any of the backups.

Additionally, appliances transfer backups to an off-site Ubuntu server, called “Vault”, so that backups can be available even after a complete disaster at the customer location. While it’s perfectly feasible to transfer backup increments from an Appliance to a Vault over internet, it’s often impossible to transfer the initial backup files that can be extremely large. Replibit’s Vaults offer the ability to import the first batch of backups from appliances from a flash drive.

The Challenge

Every aspect of taking on this project was a challenge. Literally thousands of lines of Python and Javascript code. No documentation whatsoever about how Replibit worked. We had to start by installing the system locally and start monitoring the browser’s network activity to discover backend endpoints. On the Linux side, we had to find out who was listening at port 80 and from there, go all the way up the the ZFS commands called from Python to see who actually performed the file system operations on the server.

Our first task was to add the ability to handle multiple users per account. Updating HTML and Javascript was easy enough: the difficult part was to find out how to effectively add users with the right role, and grant them access to backups without breaking any existing functionality.

After that, we had to get rid of all security issues and make the updating and installation process robust. Replibit software updates itself, so introducing a bug there would result in thousands of servers being unable to update and a big issue for support and system administrators.

Achievements

But what really helped Replibit to move on to the next phase was the ability to overcome the 2TB limitation in Windows Systems. Modern Windows systems come with GPT instead of MBR. MBR allowed us to address only 2TB of disk size. Replibit’s agent was coded in C++ and was able to get backups of any size but the virtualization technique used by Replibit was unprepared to boot disks larger than 2TB. On backup file mounting, fuse would execute code creating the MBR needed by the virtualize system in order to boot. Since there was no code for GPT, booting these backups was impossible. This type of feature is extremely difficult to test, having to boot a Virtual Machine many times, reviewing byte-by-byte code, with absolutely no documentation or help from any other party.

Some tasks we also performed:

  • Continued development of a FUSE File System for BDR (Backup Disaster Recovery) that automatically transforms non-bootable device snapshots in exact drive images from the original machine (C++).
  • Worked on frontend and backend of the application based on Flask, Twisted, Nginx and Python.
  • Provided highest level of support to customers assessing data loss and data recovery options.
  • Code reviewed the system, while security assessed the application to solve security issues.
  • Overview ZFS issues, data loss, and data migration.
  • Created a UEFI Agent Service to allow virtualization on new systems. Provided highest level of support to customers assessing data loss and data recovery options.
  • Code reviewed the system, while security assessed the application to solve security issues.

When Replibit reached out to us, they had no development process and the source code was not secure. There was no code repository and no way of tracking down changes and nobody to ask. By the end of the project, Replibit had a code repository, correctly tagged with releases and a good development process thought out to match their needs regarding new features and bug fixing. Their update process was robust and functional, and many critical features and fixes were in place, allowing them to be acquired by eFolder and become a really nice solution for system administrators.

  • Node
  • React
  • Redux
  • Python
  • Amazon Web Services
  • Webtask
Let’s build a great product together
We treat projects as if they were our own, understanding the underlying needs and astonishing users with the results.
Contact us