How to configure ElasticSearch snapshots

These are a few notes on how to backup and restore ElasticSearch indices. It worked for me, but I’m not an ES expert by any means, so if there’s a better way or something horribly wrong, let me know!

Mount the Shared Storage

The most basic form of snapshots, without using plugins for S3 or other distributed filesystems, uses a shared filesystem mounted on all nodes of the cluster. In my specific case this was a NFS share

# yum -y install nfs-utils rpcbind
# systemctl enable rpcbind
# vim /etc/fstab
... add your mountpoint ...
# mkdir /nfs
# mount /nfs

Set the Shared Storage as repository for ES

After the mount point is configured, you need to set it as a repository in the ES configuration, edit /etc/elasticsearch/elasticsearch.yml and add the key:

path.repo: /nfs

You’ll need to restart each node after the change and wait for it to rejoin the cluster.

Create the snapshot repository in ES

When the configuration is ok you should be able to create your snapshot repository. I made this script based on the official documentation:

#!/bin/bash

repo_name="backup"
repo_location="backup-weekly"

/usr/bin/curl -XPUT "http://localhost:9200/_snapshot/${repo_name}?pretty" -H 'Content-Type: application/json' -d"
{
  \"type\": \"fs\",
  \"settings\": {
    \"location\": \"${repo_location}\"
  }
}
"

This would save the snapshot in /nfs/backup-weekly/ and the repository name would be backup.

Create your first snapshot

Now you should be able to create your first snapshot. I created another script that takes one daily snapshot each day of the week. Please note: make sure the path to date is correct, or the command will fail and the snapshot_name will be empty, thus deleting the repository instead of the snapshot!

#!/bin/bash

repo_name="backup"
snapshot_name=$(LC_ALL=C /usr/bin/date +%A|tr '[:upper:]' '[:lower:]')

target="vip-es"

# delete the old snapshot (if any)
echo $(date) DELETE the old snapshot: $snapshot_name >> /var/log/es-backup.log
/usr/bin/curl -XDELETE "http://${target}:9200/_snapshot/${repo_name}/${snapshot_name}" >> /var/log/es-backup.log

echo $(date) CREATE the new snapshot: $snapshot_name >> /var/log/es-backup.log
/usr/bin/curl -XPUT "http://${target}:9200/_snapshot/${repo_name}/${snapshot_name}?wait_for_completion=true&pretty" >> /var/log/es-backup.log

The output should be something like:

ven 9 feb 2018, 01.31.01, CET DELETE the old snapshot: friday
{"error":{"root_cause":[{"type":"snapshot_missing_exception","reason":"[backup:friday] is missing"}],"type":"snapshot_missing_exception","reason":"[backup:friday] is missing"},"status":404}
ven 9 feb 2018, 01.31.01, CET CREATE the new snapshot: friday
{
  "snapshot" : {
    "snapshot" : "friday",
    "uuid" : "12345679-20212223",
    "version_id" : 6010199,
    "version" : "6.1.1",
    "indices" : [
      "test_configuration",
      ".kibana"
    ],
    "state" : "SUCCESS",
    "start_time" : "2018-02-09T00:31:01.586Z",
    "start_time_in_millis" : 1518136261586,
    "end_time" : "2018-02-09T00:31:04.362Z",
    "end_time_in_millis" : 1518136264362,
    "duration_in_millis" : 2776,
    "failures" : [ ],
    "shards" : {
      "total" : 25,
      "failed" : 0,
      "successful" : 25
    }
  }
}

In this case the DELETE failed because I didn’t have a previous snapshot for the current day.

List the available snapshots

To operate on the snapshots I made another script to list them by name, using jq. You’ll need to install it first (on CentOS 7: yum -y --enablerepo=epel install jq).

#!/bin/bash

repo_name="backup"

/usr/bin/curl -sS "http://localhost:9200/_snapshot/${repo_name}/_all" | jq '.snapshots[] | .snapshot,.end_time'

The output is just a list of snapshot names and their timestamps:

# bash list_snapshots.sh
"wednesday"
"2018-02-07T02:36:05.564Z"
"thursday"
"2018-02-08T02:37:10.403Z"
"friday"
"2018-02-09T02:31:04.362Z"

Restore a snapshot

No backup can be considered “good” without testing a restore from it. So I made another script to test how the restore would work on a separate test environment:

#!/bin/bash

repo_name='prod'
snap_name='wednesday'

for index_name in $(/usr/bin/curl -sS http://localhost:9200/_aliases | /usr/bin/jq 'keys | .[]' | sed -s "s/\"//g" ); do
    /usr/bin/curl -XPOST "http://localhost:9200/${index_name}/_close"
done

/usr/bin/curl -XPOST "http://localhost:9200/_snapshot/${repo_name}/${snap_name}/_restore?pretty"

I’m pretty sure there must be a better way to do this: what I’m doing is getting all the current indices and closing them all one by one (because you can’t restore an index that is currently open), then restoring the snapshot I copied over from the other environment.

It’s pretty horrible, but it works, if you know a better way let me know and I’ll change it, if you don’t… well, it works :)

References

Advertisements

Synchronize a directory structure with Ansible

Disclaimer: this is not ideal. We should manage the whole configuration with Ansible. “Baby steps” I guess… :)
Consider this as a workaround I hope you’ll never have to resort to, but I’m sharing it just in case…

We’re migrating from some old scripts to using Ansible to handle some of our clients deploys.

One of the tasks that were handled by these bash scripts is to synchronize a directory structure, so that the application log files would always find the same directory structure on every application server.

We used rsync for that, copying only the directories:

rsync -av -f"+ */" -f"- *" /path/to/app/ $target:/path/to/app/

To translate this to Ansible we used two tasks:

---
- name: Deploy log directories
  vars:
    dir_log_path: /var/log/nginx
  hosts: webservers
  serial: 10%
  tasks:
  - name: find log directories
    find:
      paths:
      - '{{ dir_log_path }}'
      file_type: directory
    register: log_dirs
    delegate_to: ws-deploy

  - name: create log directories
    file:
      path: "{{ item.path }}"
      state: directory
      owner: "{{ item.uid }}"
      group: "{{ item.gid }}"
      mode: "{{ item.mode }}"
    with_items: "{{ log_dirs.files }}"

We record in the log_dirs variable the directories existing on ws-deploy, the server where we have the latest configuration loaded on, then we recreate the same structure using the file module in Ansible on all the other webservers.

Running NodeJS on CentOS 7 in production: a few pointers

Not ready to write an article about this yet, but I’m gonna share a few links.

Related stuff:

Upgrade VMware ESX from the command line

VMware has some of the worst documentation in the entire industry, so I’m saving here these notes for fellow admins that need to deal with this.

First, you have to identify your current profile

# ssh YOUR_ESX_SERVER
~ # esxcli software profile get
(Updated) HP-ESXi-5.1.0-standard-iso
   Name: HP-ESXi-5.1.0-standard-iso
   Vendor: YOUR_VENDOR
   Creation Time: 2017-11-07T15:24:51
   Modification Time: 2017-11-07T15:25:06
   Stateless Ready: False

With that info you can go on the VMware website (or your vendor website) and download the new release. In my case this is an HP server, so I downloaded VMware-ESXi-5.5.0-Update3-3116895-HP-550.9.4.26-Nov2015-depot.zip from the download page.

Then I loaded the depot file to the ESX server:

# scp VMware-ESXi-*-depot.zip YOUR_ESX_SERVER:/vmfs/volumes/YOUR_VOLUME_NAME/

At this point I shut down all VMs on the server and put it in maintenance mode, then logged back in in console, found out the new profile name and ran the upgrade:

# ssh YOUR_ESX_SERVER
# ~ esxcli software sources profile list -d /vmfs/volumes/YOUR_VOLUME_NAME/VMware-ESXi-5.5.0-Update3-3116895-HP-550.9.4.26-Nov2015-depot.zip
Name                              Vendor           Acceptance Level
--------------------------------  ---------------  ----------------
HP-ESXi-5.5.0-Update3-550.9.4.26  Hewlett-Packard  PartnerSupported

~ # esxcli software profile update -d /vmfs/volumes/YOUR_VOLUME_NAME/VMware-ESXi-5.5.0-Update3-3116895-HP-550.9.4.26-Nov2015-depot.zip -p HP-ESXi-5.5.0-Update3-550.9.4.26
Update Result
  Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
  Reboot Required: true

I rebooted the system, as required, and then logged back in to check the updates:

~ # esxcli software profile get
(Updated) HP-ESXi-5.1.0-standard-iso
   Name: (Updated) HP-ESXi-5.1.0-standard-iso
   Vendor: YOUR_VENDOR
   Creation Time: 2017-11-07T15:24:51
   Modification Time: 2017-11-07T15:25:06
   Stateless Ready: False
   Description: 

      2017-11-07T15:24:51.436759+00:00: The following VIBs are
      installed:
        net-bnx2x     2.712.50.v55.6-1OEM.550.0.0.1331820
        ata-pata-amd  0.3.10-3vmw.550.0.0.1331820
        sata-sata-sil24       1.1-1vmw.550.0.0.1331820
[...]

Hopefully nobody out there will have to deal with this, but if you do, I hope I got you covered.

Reference:

How to solve OpenVPN errors after upgrading OpenSSL

I went on upgrading OpenVPN and OpenSSL on an old production system, but after restarting the service, the clients would not connect. There were two different problems a “CRL expired” error and after fixing that a “CRL signature failed” error.

CRL expired

The OpenVPN server logs were reporting:

Mon Nov  6 10:04:22 2017 TCP connection established with [AF_INET]192.168.100.1:19347
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 TLS: Initial packet from [AF_INET]192.168.100.1:19347, sid=150b3618 b004e9a4
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 VERIFY ERROR: depth=0, error=CRL has expired: C=IT, ST=PR, L=Parma, O=domain, OU=domain.eu, CN=user, name=user, emailAddress=info@stardata.it
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 OpenSSL: error:140890B2:SSL routines:SSL3_GET_CLIENT_CERTIFICATE:no certificate returned
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 TLS_ERROR: BIO read tls_read_plaintext error
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 TLS Error: TLS object -> incoming plaintext read error
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 TLS Error: TLS handshake failed
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 Fatal TLS error (check_tls_errors_co), restarting
Mon Nov  6 10:04:23 2017 192.168.100.1:19347 SIGUSR1[soft,tls-error] received, client-instance restarting

This is a common problem on older systems, the culprit is in the OpenSSL configuration used to generate the CRL, that is limited to just 30 days by default.

So, I had to regenerate my CRL after increasing the default_crl_days parameter in the ssl config to 180 (for our use case is more than enough), using:

$ openssl  ca  -gencrl  -keyfile keys/ca.key  \
               -cert keys/ca.crt  -out keys/crl.pem \
               -config easy-rsa/openssl-1.0.0.cnf

CRL signature failed

Due to the vulnerabilities found in MD5, this hashing routine has been disabled by default on modern SSL. Our certificates, though, were still using it, so the new error message (after fixing the CRL), became:

Mon Nov  6 10:14:40 2017 TCP connection established with [AF_INET]192.168.100.1:18463
Mon Nov  6 10:14:41 2017 192.168.100.1:18463 TLS: Initial packet from [AF_INET]192.168.100.1:18463, sid=13fdd1fe 5d82d4d6
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 VERIFY ERROR: depth=0, error=CRL signature failure: C=IT, ST=PR, L=Parma, O=domain, OU=domain.eu, CN=user, name=user, emailAddress=info@stardata.it
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 OpenSSL: error:140890B2:SSL routines:SSL3_GET_CLIENT_CERTIFICATE:no certificate returned
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 TLS_ERROR: BIO read tls_read_plaintext error
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 TLS Error: TLS object -> incoming plaintext read error
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 TLS Error: TLS handshake failed
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 Fatal TLS error (check_tls_errors_co), restarting
Mon Nov  6 10:14:42 2017 192.168.100.1:18463 SIGUSR1[soft,tls-error] received, client-instance restarting

This one was trickier to solve. It turns out that you can re-enable MD5 as a workaround using two environment variables: NSS_HASH_ALG_SUPPORT=+MD5 and OPENSSL_ENABLE_MD5_VERIFY=1. In my case, I just added them to openvpn init script because the system is going to be decommissioned soon.

References

Develop a Jekyll website… without Jekyll

Containers are great for developers: when I’m messing around with code I try to keep everything neatly containerized, so I can just pull my repository on some other machine, run a few scripts and be ready to keep on developing without having to install stuff on the main Operating System.

Jekyll is a nice Static Website Generator, used prominently on GitHub. An already-made Jekyll container exists, but I couldn’t find out how (or even if) you could use it to create a Jekyll website from scratch. So I fired up a generic Ruby container and installed jekyll in it to create the base layout, then I ran the already-made jekyll container to build the website.

$ cat > Gemfile <<EOF
source 'https://rubygems.org'
gem "jekyll"
EOF

$ docker run --rm --volume=$PWD:/usr/src/app -w /usr/src/app -it ruby:latest /bin/bash
[container#1]# bundle install
[...]
Fetching jekyll 3.6.0
Installing jekyll 3.6.0
Bundle complete! 1 Gemfile dependency, 20 gems now installed.
Bundled gems are installed into /usr/local/bundle.
[container#1]# jekyll new test01
[...]
Bundler: Using jekyll 3.6.0
Bundler: Bundle complete! 1 Gemfile dependency, 20 gems now installed.
Bundler: Bundled gems are installed into /usr/local/bundle.
New jekyll site installed in /usr/src/app/test01.
[container#1]# exit
$ ls -l test01
-rw-r--r-- 1 root root  398 ott 14 16:40 404.html
-rw-r--r-- 1 root root  539 ott 14 16:40 about.md
-rw-r--r-- 1 root root 1,7K ott 14 16:40 _config.yml
-rw-r--r-- 1 root root  937 ott 14 16:40 Gemfile
-rw-r--r-- 1 root root  213 ott 14 16:40 index.md
drwxr-xr-x 2 root root 4,0K ott 14 16:40 _posts

Once I had the basic site structure ready, I ran the Jekyll container to build it:

$ cd test01
$ docker run --rm  --volume=$PWD:/srv/jekyll  -it  jekyll/jekyll:latest  jekyll build
Resolving dependencies...
The Gemfile's dependencies are satisfied
Configuration file: /srv/jekyll/_config.yml
            Source: /srv/jekyll
       Destination: /srv/jekyll/_site
 Incremental build: disabled. Enable with --incremental
      Generating...
                    done in 0.292 seconds.
 Auto-regeneration: disabled. Use --watch to enable.
$ ls -lh _site/
-rw-r--r-- 1 velenux velenux 5,5K ott 14 16:44 404.html
drwxr-xr-x 2 velenux velenux 4,0K ott 14 16:44 about
drwxr-xr-x 2 velenux velenux 4,0K ott 14 16:44 assets
-rw-r--r-- 1 velenux velenux 3,7K ott 14 16:44 feed.xml
-rw-r--r-- 1 velenux velenux 5,5K ott 14 16:44 index.html
drwxr-xr-x 3 velenux velenux 4,0K ott 14 16:44 jekyll

So… that’s it, you can now develop your Jekyll website without having Jekyll
installed on your system.

Centralize your logs on the cheap on obsolete systems

This is a tale about what you should never do, but you are often forced to do in this time and age.

I’ll explain the technical solution and then tell the story for some context.

On a recent-ish system, install multitail:

# yum -y --enablerepo=epel install multitail

multitail allows to follow multiple tail or even the output of multiple commands in one single window (or multiple windows handled by ncurses), but it also allows to save the output of those commands to another file. In my case the command line looked like:

multitail --mergeall -D -a all.log \
  -l 'ssh web01 "tail -qF /var/log/apache2/*.log /var/log/apache2/*/*.log"' \
  -l 'ssh web02 "tail -qF /var/log/apache2/*.log /var/log/apache2/*/*.log"'

This would create a file all.log containing the output from tail -qF of Apache logs from web01 and web02.

So, what’s the backstory? Why would I do something like this? Centralized logs are nothing new, right? We have Solutions[tm] for that.

Backstory

Imagine you have a time constraint of “one hour”.

Then imagine you have systems so obsolete that the signing key (valid for 10 years) for their repositories expired.

If I had more time I would try to see if rsyslog was recent enough to have the text-input file module and I would’ve tried to have rsyslog push the logs to a more recent system with logstash/ELK on.

Bonus code

I made a little script to generate the multitail commandline, here, have fun:

#!/bin/bash
HOST_LIST="web01 web02"
LOG_LIST="/var/log/apache2/*.log /var/log/apache2/*/*.log"

CMD_MULTITAIL="multitail --mergeall -D -a all.log"

for target in $HOST_LIST ; do
  CMD_MULTITAIL="$CMD_MULTITAIL -l 'ssh $target \"tail -qF $LOG_LIST\"'"
done

echo $CMD_MULTITAIL

I seriously hope nobody (else) will ever need this, but if you do, I got you covered.