Disaster Recovery
VirtFusion 2.3+ supports DR (backup/restore) to minimize downtime and data loss in the event of a disaster.
Data can either be stored in S3 compatible object storage or on a local disk partition.
For a list of S3 compatible providers, see here.
This may look complicated, but it really isn't. VirtFusion will try it's very best to produce consistent backups that can be restored easily. You just need to configure it on the hypervisor the way you want.
Although disaster recovery is supported on hypervisors that utilize shared storage, it will only create backups of locally stored disks and configurations. All shared storage (i.e Ceph) disks will be ignored.
It's expected that later versions of VirtFusion will incorporate a GUI component for disaster recovery.
Hypervisor Preparation
S3
You only need to follow this step if you want to use object storage.
Installing the AWS CLI tools
To install the CLI tool, unzip
is required on the hypervisor. It should be installed using your distro package
manager. i.e apt install unzip
or dnf install unzip
.
Once you have unzip, install the tool.
cd /tmp
curl https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o awscliv2.zip
unzip awscliv2.zip
./aws/install
rm -rf /tmp/aws
rm -f /tmp/awscliv2.zip
The official AWS instructions can be found here.
Configuring the AWS config profile
Create the file /root/.aws/config
.
mkdir -p /root/.aws
nano -w /root/.aws/config
In that file you can configure your access credentials for your S3 compatible service.
[profile virtfusion]
aws_access_key_id=ACCESS_KEY
aws_secret_access_key=SECRET_ACCESS_KEY
retry_mode = standard
max_attempts = 3
s3 =
#max_concurrent_requests = 20
#max_bandwidth = 5MB/s
addressing_style = path
You should leave the profile name as virtfusion
.
Local Storage
Nothing extra needs to be configured for local storage.
Configuring VirtFusion
Create the file /home/vf-data/conf/dr.json
.
nano -w /home/vf-data/conf/dr.json
In that file you can configure your S3 bucket, endpoint and region or/and your local storage.
{
"type": "S3",
"storeDirType": "once",
"tmpDir": "home/vf-data/tmp",
"threads": 1,
"S3": {
"endpoint": "https://s3.eu-west-1.wasabisys.com",
"region": "eu-west-1",
"bucket": "backups",
"compress": true,
"compressLevel": 1,
"prune": {
"enabled": true,
"hoursGrace": 168,
"keepLast": 30
}
},
"local": {
"path": "backups",
"compress": true,
"compressLevel": 1,
"prune": {
"enabled": true,
"hoursGrace": 168,
"keepLast": 30
}
}
}
type
should be eitherS3
orlocal
storeDirType
may be one of the following:
Type | Example Path | Description |
---|---|---|
once | virtfusion_dr/[HV_ID]/once | The once type will create a static directory named once, meaning each time a backup runs, the files will be overwritten. |
Y-M-D-H | virtfusion_dr/[HV_ID]/y_m_d_h/[2023_06_01_04] | Typically this will create a new directory every hour. |
Y-M-D | virtfusion_dr/[HV_ID]/y_m_d/[2023_06_01] | This will create a new directory each day. |
Y-M | virtfusion_dr/[HV_ID]/y_m/[2023_06] | This will create a new directory each month. |
Y-W | virtfusion_dr/[HV_ID]/y_w/[2023_03] | This will create a directory weekly (Weeks starting on Monday). |
DW | virtfusion_dr/[HV_ID]/dw/[1-7] | This will create a directory for the day number of the week (1 for Monday, 7 for Sunday). |
All date style types are based on UTC.
-
tmpDir
is where temporary snapshots will be stored. You don't need to include a/
at the start or end. -
threads
how many backups can be running at any one time. The option is currently ignored and only 1 thread will be used until a future version. -
S3.endpoint
should be the URL to your S3 service. -
S3.region
is the S3 region of your S3 service. -
S3.bucket
should be the name of the bucket you created to store the backups. -
S3.compress
will enable lz4 compression on server disks. Enabling compression will usually slow down the backup process, especially on a fully loaded system, but saves storage space. It's advised to keep this option enabled. -
S3.compressLevel
(1 to 16) 1 = fast (default), 16 = slow. Be warned - 16 will be better compression, but it will be very very very very slow! It's advised to keep this option set to 1. -
S3.prune.enabled
can be true or false. This option will enable or disabled backup pruning. -
S3.prune.hoursGrace
is only used for the oncestoreDirType
setting and will wait the specified number of hours before deleting files that no longer belong to the hypervisor. -
S3.prune.keepLast
is used for all date stylestoreDirType
settings and is used to specify how many of the newest folders should be kept on storage. -
local.path
should be the local path where you want to store the backups. You don't need to include a/
at the start or end. -
local.compress
will enable lz4 compression on server disks. Enabling compression will usually slow down the backup process, especially on a fully loaded system, but saves storage space. It's advised to keep this option enabled. -
local.compressLevel
(1 to 16) 1 = fast (default), 16 = slow. Be warned - 16 will be better compression, but it will be very very very very slow! It's advised to keep this option set to 1. -
local.prune.enabled
can be true or false. This option will enable or disabled backup pruning. -
local.prune.hoursGrace
is only used for the oncestoreDirType
setting and will wait the specified number of hours before deleting files that no longer belong to the hypervisor. -
local.prune.keepLast
is used for all date stylestoreDirType
settings and is used to specify how many of the newest folders should be kept on storage.
It's probably wise to backup both /root/.aws/config
and /home/vf-data/conf/dr.json
for easy access in case of a
fatal hypervisor failure.
Creating Backups
- If you run a backup manually, it's highly advised to run the command within a screen session.
- Even though signals are caught and the DR routine will attempt to exit cleanly, resist the urge to ctrl+c when a backup is running. Due to the nature of external snapshots, this may cause inconsistency on disk, which can be hard to repair.
- When a backup runs on a server disk, some of the server actions like reboot, reinstall, shutdown etc... will be blocked at hypervisor level. Once the backup process has been completed on that disk, the actions will be unlocked.
Generate a list of files that will be included in a backup (with filesize estimates)
vfcli-hv dr:backup --estimate-only
It will generate something similar to this.
+---------------------------------------------------------------+---------+-----------+
| Item | Type | Size |
+---------------------------------------------------------------+---------+-----------+
| /home/vf-data/disk/37fc81d8-402c-41a1-8f7b-5f421c7e2d20_1.img | file | 3.81 GB |
| /home/vf-data/disk/d33ab991-ec0e-491b-b38a-0999f01bbfc9_1.img | file | 6.76 GB |
| /home/vf-data/disk/6ba9b79c-0168-4988-ba6e-b7674c7c2c08_1.img | file | 1.8 GB |
| /home/vf-data/disk/6ba9b79c-0168-4988-ba6e-b7674c7c2c08_2.img | file | 6.69 MB |
| /home/vf-data/disk/6ba9b79c-0168-4988-ba6e-b7674c7c2c08_3.img | file | 6.69 MB |
| /home/vf-data/server | dir | 1.11 MB |
| /home/vf-data/stats | dir | 109.85 MB |
| /home/vf-data/events | dir | 0 B |
| /home/vf-data/conf | dir | 802 B |
| /opt/virtfusion/app/hypervisor/storage/stats | dir | 3.04 KB |
| /opt/virtfusion/app/hypervisor/storage/logs | dir | 314.9 KB |
| /opt/virtfusion/app/hypervisor/storage/nat | dir | 136 B |
| /opt/virtfusion/app/hypervisor/conf/auth.json | file | 339 B |
| /etc/haproxy/haproxy.cfg | file | 1.35 KB |
| /opt/virtfusion/app/hypervisor/database/database.sqlite | file | 12 KB |
| /opt/virtfusion/app/hypervisor/database/queue.sqlite | file | 16 KB |
|---------------------------------------------------------------|---------|-----------|
| Total (approx) | 12.49 GB |
+---------------------------------------------------------------+---------+-----------+
Backup everything
vfcli-hv dr:backup
Backup only specific servers
vfcli-hv dr:backup --only-servers=1754,1253,1002
Backup everything except disks that already exist on the backup storage
vfcli-hv dr:backup --only-missing-disks
Backup everything but exclude specific servers
vfcli-hv dr:backup --exclude-servers=1754,1253,1002
--exclude-servers=
should be a comma seperated list of server ids.
Backup everything except the system data (only server disks)
vfcli-hv dr:backup --exclude-system-data
Automating Backups
Obviously you will want to automate your backups. You can do this using a cronjob or a systemd timer.
Cronjob
Create a file in /etc/cron.d/
called virtfusion_dr
and use any of the following, or add your own.
Run daily at 2am
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 2 * * * root /usr/bin/vfcli-hv dr:backup >/dev/null 2>&1
Run every Friday at 2am
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 2 * * FRI root /usr/bin/vfcli-hv dr:backup >/dev/null 2>&1
Run on the first day of the month at 2am
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 2 1 * * root /usr/bin/vfcli-hv dr:backup >/dev/null 2>&1
Run every day of week from Monday through Friday at 2am
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
0 2 * * 1-5 root /usr/bin/vfcli-hv dr:backup >/dev/null 2>&1
Restoring Backups
SOURCE_PATH
should be the path after the hypervisor ID. For example if you use Y-M-D
for the storeDirType
, the
path would be y_m_d/2023_06_01
(swap out the date you want) or if you use once
it would be once
.
Restoring system files
vfcli-hv dr:restore --mode=system --source=once --system-opts=haproxy,eventhooks,config
--system-opts=
should be a comma seperated string of any of the following options.
Option | Description |
---|---|
haproxy | Restore the HAProxy configuration files. |
eventhooks | Restore all custom event hooks. |
config | Restore all system configuration files and logs. |
Restoring all servers
vfcli-hv dr:restore --mode=server --servers=all --source=[SOURCE_PATH] --power-control=true
Restoring specific servers
You may restore specific servers based on their ID using the --servers=
argument.
vfcli-hv dr:restore --mode=server --servers=1,5,7,45,193 --source=[SOURCE_PATH] --power-control=true
You can also use the following command to fetch a pre-populated --servers
argument.
vfcli-hv dr:restore --mode=servers-arg
Which would output something like the following.
--servers=1387,1390,1393,1465,1698,1794
Restoring after a disaster (Replacement hypervisor)
If the dreaded does happen, you can use DR to restore your backups on to the replacement hypervisor.
You should NOT remove the failed hypervisor from VirtFusion. It MUST remain, as this is where your virtual servers are linked.
Hypervisor preparation
You've now got to the stage where you have a new hypervisor, and it has the base OS installed.
The next step is to run the VirtFusion installer as you normally would when installing a new hypervisor. This also includes setting up your network.
Configuring access to your backups
Once you have VirtFusion installed and a working bridge (or whatever you use for networking), you need to link the new
hypervisor to your control server. This can be done using a special DR mode called get-auth
.
Before you can run this mode, you need to setup access to your backup storage. You should follow this previous step.
Re-connecting the hypervisor to the control server
You should now have working access to your storage. You can run the following command to download the auth.json file that is used to connect the control server to the hypervisor.
vfcli-hv dr:restore --mode=get-auth --source=[SOURCE_PATH]
SOURCE_PATH
must include the hypervisor ID as listed in VirtFusion. For example: --source=26/y_m_d/2023_06_06
If all goes well, you should see something like the following.
root@test:~# vfcli-hv dr:restore --mode=get-auth --source=26/y_m_d/2023_06_06
S3 connection check success!
Found backup storage source: virtfusion_dr/26/y_m_d/2023_06_06
Are you sure you would like to continue? (yes/no) [no]:
> y
download: s3://backup-vf/virtfusion_dr/26/y_m_d/2023_06_06/system/auth.json to ../opt/virtfusion/app/hypervisor/conf/auth.json
Downloaded successfully
root@test:~#
The hypervisor should now be linked again and the test connection button in VirtFusion should report success.
Worse case, you can manually download the auth.json
and save it as /opt/virtfusion/app/hypervisor/conf/auth.json
.
Restoring the system files
The following command will restore the VirtFusion system files.
vfcli-hv dr:restore --mode=system --system-opts=all --source=[SOURCE_PATH]
SOURCE_PATH
should NOT include the hypervisor ID. Now the control server is linked, It already knows the ID. For
example: --source=y_m_d/2023_06_06
Restoring the servers
The following command will restore all servers and configure them.
vfcli-hv dr:restore --mode=server --servers=all --power-control=true --source=[SOURCE_PATH]
That should be it. The --power-control=true
option should have taken care of booting all the servers (or keeping them
turned off if suspended) and they should now be online.
Pruning Backups
You can remove any expired backups using the following command. This command will use the prune.[value]
settings that
were configured in this section.
You may also set this command as a cronjob.
vfcli-hv dr:prune
Known Issues
Issue | Description | Action/Fix/Workaround |
---|---|---|
CentOS/CloudLinux system freeze | The QEMU guest agent freezes the file systems so no single change will be made during the backup, but the guest agent does not respect the loop* devices in freezing order which leads to a hung task and kernel crash. This is a bug in the guest agent, not VirtFusion. | Disable qemu-guest-agent |
S3 Compatible Storage Providers
A non-exhaustive list of providers.
Provider | America | Europe (EMEA) | Asia (APAC) |
---|---|---|---|
iDrive e2 | Oregon, Los Angeles, Virginia, Chicago, Miami, Dallas, San Jose, Phoenix, Montreal (Canada) | Ireland, London, Madrid, Paris, Frankfurt | Singapore |
Wasabi | Oregon, Virginia, Plano (TX), Toronto (Canada) | London, Paris, Amsterdam, Frankfurt | Tokyo, Osaka, Sydney, Singapore |
Cloudflare R2 | Western North America, Eastern North America | Western Europe, Eastern Europe | Asia-Pacific |
Backblaze B2 | US East, US West | EU Central | - |
Contabo | United States | Germany | Singapore |
Scaleway | - | Amsterdam, Paris, Warsaw | - |
DigitalOcean | New York City, San Francisco | Amsterdam, Frankfurt | Singapore, Sydney |
Huawei OBS | Mexico City, Santiago, Sao Paulo | Johannesburg, Istanbul | Bangkok, Singapore, Shanghai, Beijing, Guangzhou, Hong Kong |
Vultr | New Jersey, Silicon Vally | Amsterdam | Delhi, Bangalore, Singapore |
OVH | - | Gravelines, Strasbourg, Frankfurt, Beauharnois, Roubaix, Warsaw, London | - |
IONOS | - | Frankfurt, Berlin, Logrono (Spain) | - |
Alibaba OSS | Silicon Valley, Virginia, | Frankfurt, London, Dubai | Hangzhou, Shanghai, Nanjing, Qingdao, Beijing, Zhangjiakou, Hohhot, Ulanqab, (Shenzhen, Heyuan, Guangzhou, Chengdu, Hong Kong, Tokyo, Seoul, Singapore, Sydney, Kuala Lumpur, Jakarta, Manila, Bangkok, Mumbai |
DreamHost DreamObjects | US East | - | - |
Amazon S3 | Ohio, N. Virginia, N. California, Oregon, Canada, São Paulo | Cape Town, Frankfurt, Ireland, London, Milan, Paris, Stockholm, Spain, Zurich, Bahrain, UAE | Hong Kong, Hyderabad, Jakarta, Melbourne, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo, Beijing, Ningxia |
You can also host your own with solutions like MinIO.