Getting bit by Discontinuity

DISCLAIMER from a LAMER: The transition planning process used here is unacceptable outside of total disaster recovery where you don’t care about the outcome. I will provide a new post in the future that more adequately covers the considerations that should be made, many of which I simply didn’t care about for this migration.

***

So, you’ve got a bunch of servers that do all the things for your infrastructure and you’re super proud of yourself for having built them all out.

Congratulations.

Service X performs organizational function Y. You met the objective. Good job.

But, you missed something. Something big.

Unfortunately, since that’s all you focused on while designing and building out your project, it is, whether you know it or not yet, a total failure that will eventually be ripped out entirely, and you’ve now put your name on something that’s going to be an expensive ongoing headache for the organization you did it for as an entity, the people that work for them, or for you, who have to maintain it, and — in fact, the only people who are excited about it are competing rivals, since you’ve just put out crud that people will hear about, and red teams targeting your client/organization who will prey on security openings created by maintenance failures created by a lack of process uniformity arising from your poor design considerations.

If you’re a guilty party, it’s okay, I did it too with my stuff, let’s talk about continuity.

“What is continuity, though?”, you might ask.

Continuity from an IT infrastructure architectural engineering perspective is how closely the services, and the servers themselves, resemble one another in configuration, naming conventions, installed software sets, network routings et al. This also should include processes used in maintenance operations, DevOps operations (if you’re managing a software development lifecycle), how software is installed, how you’re tracking and making changes on servers and service configurations, how problems are repaired and documented, and who makes those changes.

From a software engineering perspective it’s a little more than that. Software engineering environments depend on infrastructure continuity a great deal more and so do business operations environments. In addition to resting on top of sound infrastructure continuity, development environments need to create continuity of their own. Let’s look at it from the software engineering perspective since no business person will ever want to read this and I’m okay with that:

Great. How do I get started?

Now that you have an idea of what continuity is you should immediately be realizing not only how important it is, but how many environments you’ve seen where it just isn’t there, or where huge discontinuity causes problems. It is there if you look for it, every organization has IT continuity issues.

I’ve got just the perfect environment as an example. I’ve got servers that I built out for my own use, organically, as needed per project, and have absolutely no continuity — not even the system users. That’s what prompted this article.

So, in my case, since I do most of the development and system administration, continuity is an extra bit of work that was sacrificed to speed up the time it takes to get something done — which is usually the reason why you see it in your own environments. Now that I need it, it’s too late and it’s time to transition.

We’ll just do the basics.
The users.
Their passwords, uids, and gids.
The services.
The configuration of those services.

Right off the bat I see an issue. Since we’re going to want some continuity with services and configuration, since this is something that can’t be solved with a syncing operation we’ll want to at least have the configuration structures be uniform. There are more complex ways to do this but we can just make paths and configuration directory trees the same and go from there in my use case. This highlights the next problem — some of the servers are different Linux distributions, which use different default paths. This will require me to either use different, non-standard paths on some, or all, if I decide to be okay with OS-level discontinuity. In my case Fedora and Ubuntu Server, and I believe one, though I can’t recall which, is Arch Linux (which I’d like to decomission from my environment anyway).

While you can spend all night tweaking paths and init scripts to make these all uniform, in my case it’s less work with less reward than simply rebuilding it. So, before I do any of this I will want to map out my environment and plan to reinstall and reconfigure strategically. Which, honestly, is fine to me, since it was designed organically instead of for functionality.

First, we’re creating an environment inventory and we’re calling it…

CURRENT STATE ANALYSIS

Hosts and Services

I’ve got 4 VPSs and a rack server, but these serve multiple purposes for much more than 4 domains currently.

www_irc_more (Arch Linux) (VPS)

an unused, unmaintained irc server that needs destroyed

www_luxi (Ubuntu Wily) (VPS)

parent org product prod environment

www_silo (FEDORA 24) (VPS)

parent org landing, wordpress (which I want to move away from)
database, mysql
storage
prod for some showcase products,
gitlab host
znc bouncer

www_surro (Fedora 22) (VPS)

hosting this blog, wordpress (which I dont want in the pipeline)

oldhorse sandbox (Ubuntu Wily) (On-PREM)

preprod for all pipelines and host for the silonian archives; rackserver

That’s all I can talk about.

So I want to do some rearrangement of where which services and features are hosted. Continuity causes differences in process. Differences in process causes errors.

I am tired of checking what distro I’m on, or server I’m on to figure out where things are and I’ll want to move some services around. I want maintenance on one machine to be the same as any other. In my case I can get away with backing up, reinstalling the OS’s, and setting it back up one at a time since I’ve only got 5 public hosts to worry about.

Now let’s inventory features and services

The apache services will need to be configured uniquely so I won’t want to use a configuration management tool for the nodes for this part.

I don’t have any desire to run an IRC server.

The luxi prod environment needs thought out a bit for branding reasons (consider business needs too when you’re planning this kind of movement). This should be fine as long as luxi is branded as a silo child and the silo landing page links to luxi.

oldhorse — since I’m going with puppet for configuration management I’m going to use oldhorse as the puppet master. This should give me central control over everything in the event of an emergency.

Nothing for an orchestration layer exists currently. This would be quite nice to have but I’m not too thrilled about current free selections. I may build this out myself later after getting everything controlled by puppet.

FUTURE STATE PLANNING

Service Rearrangement

I will remove the IRC server. This will free up www_irc_more, allowing me to move the gitlab instance there to be shared by all projects.

I will make ubuntu server the new infrastructure standard. Ubuntu has, over time, proven to be the absolutely best self-managed server distribution out there besides freebsd, with greater package selection. Fedora and RHEL have not, and I don’t particularly care for how their support communities are operated. Arch is a security nightmare. Nothing else really is meeting my standards right now.

Services Additions and ISA Tweaks

Now that we have existing services and hosts inventoried, let’s see if we need additional hosts and services and where those would fit in.

Well, if we’re syncing users, passwords, gids, uids, and have shared storage, or want shared storage, we’ll want a central point to administrate them and evaluate the current security model for any impacts. Since I’m lazy, I’m using the ISA that comes with the distros and using OS-level tools for everything, creating five times as much work for myself in the process instead of justifying the build out time. LDAP would be fine for this, but I could probably also get away with using puppet for this and reducing the number of services to maintain while increasing the time I spend maintaining puppet configuration. LDAP it is.

Note: A person I did not get permission to name has accurately pointed out that a single LDAP server will create a point of failure, so I’ll need to decide between acceptable risk, loadbalancing with a slave LDAP server that syncs to a primary, or creating some kind of fallback option.

So where would the LDAP fit? How about www_irc_more which needs renamed now? It’s an unused server since its only purpose was removed from the service inventory. The other servers are either the oldhorse sandbox or production servers that should be decoupled from their user systems. But then again, having physical access to oldhorse would be handy in the event of a security event if I put it on oldhorse. Then again, I’m not so sure I want my preprod environment to be mixed up with production environment resources by hosting production credentials. So, to me, the solution is to use puppet, with oldhorse as the master, to create a local system user on each machine as a fallback in the event of LDAP failure, and create all other non-system users on www_silo in ldap, which will be given a logical name of “ldap_mnemosyne” and an A record of mnemosyne in the surroindustries.com zone file. Service accounts will be created via puppet after the local user is created, for the whole lifecycle of each server, and so will groups. Sudo access will be group-based and require root password. Root ssh will be disabled.

I suppose I could add an ldap slave to ~~www_surro~~ oldhorse (changed my mind later) behind a load balancer and point everything at the load balancer. It’ll be logically named ldap_themis with an A record of themis if I do, with the loadbalancer being an additional ~~A record~~ CNAME record just called ldap.

Software Development Lifecycle Management

Content and Application Pipeline

What about promotion of code from future state pre-prod to future state prod? oldhorse->luxi movement will need to be controlled or I’ll get lazy and start pushing stuff directly to production. I need a way to push data from prod to preprod and code from preprod to prod.

Currently, I am using git, which is both frequently suggested to me for this use case to make deployments with, but is not meant to be used as a deployment tool, as it creates flow control maladaptions in the SDLC, like using branches for environment markers, removes the handling safety provided by a staging area, and has you using git in prod, which creates security concerns in many environments when I’ve suggested it in planning meetings in the past (though I don’t particularly agree with that, I can see the argument that it’s not meant to be used as a deployment tool). I use git for my version control though and still plan to.

I want ENV->Staging->ENV.

There’s always Jenkins. I hate Jenkins. I abhor jenkins.
Custom deployment written in python? Remotely acceptable but I need another project eating up my time like I need a speeding ticket.

Ok, so it looks like it’s time to step away and research for code provisioning solutions.

Acceptance criteria is:

a prod environment that says no unless it’s being pushed from pre-prod.

After digging around, I decided with all of the bloated deployment suites that want to host your entire SDLC and create a single point of failure in your pipeline, most of which cost money, I’m better off building my own tiny solution using existing components.

I’m going to use an rsync push command over ssh to do this, so when it’s time to promote to prod, I’ll run a script from preprod that passively overwrites the target directory on prod but leaves extra files intact. In the future I’ll probably have it push to a staging directory instead, and then have a command on prod to release the staging area to production directory to make it a little cleaner, but right now that would just be a meaningless extra step.

rsync -a ~/dir1 username@remote_host:destination_directory

Now that we have code promotion worked out, it’s suggested to wipe the preprod area and deploy prod over preprod before every deployment to preprod. I’m earmarking that in case it becomes useful later.

Regarding locking down the rsyncing to prevent creating an attack surface, at the suggestion of ananke in freenode/##linux is to do the following:

noshell (I added this, his solution was not dependent on a dedicated user)
command limit in authorized_keys
host control in authorized_keys
ssh key

To something akin to the following in ~/.ssh/authorized_keys:

 'from="somehost.tld,1.2.3.4",command="/root/bin/rsnapshot_secure_wrapper",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-rsa yourkeyXZXXZXXZX'

Which suits this use case perfectly. He also supplied a handy wrapper script:

!/bin/sh
# small security wrapper for rsnapshot ssh sessions
# the goal here is to reject any commands other than
# rsync in server mode

case "$SSH_ORIGINAL_COMMAND" in
    *\&*)
	echo "rejected"
	;;
    *\(*)
	echo "rejected"
	;;
    *\{*)
	echo "rejected"
	;;
    *\;*)
	echo "rejected"
	;;
    *\<*)
	echo "rejected"
	;;
    *\`*)
	echo "rejected"
	;;
    rsync\ --server\ --sender*)
	$SSH_ORIGINAL_COMMAND
	;;
    *)
        echo "rejected"
	;;
esac

Which pretty much takes care of everything I needed it to. I use git to update from the current master on preprod, rsync push to prod if I’m happy, and preprod will double as a staging area until I get a more complex codebase established.

It was also suggested to check out travis or gitlab-ci. I’m particularly fond of gitlab and have used their community edition for repo hosting, so I’ll evaluate them and see how it works out, but, I do want to ensure code in prod only can come from preprod, and this does that.

What about a Data Pipeline?

I hope you’ve been asking this. Just as your code needs to move in a controlled fashion from where you’re testing it to where it’s public-facing, the data used by that code needs to have some control to it as well for this all to work together. If you don’t, differences in the database content between Preprod and Prod will give rise to bugs that should have been caught before they made it to prod. The only way for this to be meaningful, in my opinion, is for DATA to move in the opposite direction as your code along the pipeline:

Preprod data is wiped, Prod data moves on top of it.

The only exception to this that I can think of would be environment metadata, or information specific to that environment that is handled by code to know what environment it is running in. Of course, I’m thinking of properties files for JVMs (not in my use case, but this is a great example), or database records that tell the code which environment it’s running in for more complex approaches to deployment management.

As for the databases and storage, I’ll be using rsync in an identical fashion as our promotion path for flat file data, and the phpmyadmin database cloning features and some custom adhoc scripts to shape the dataset prior to ingestion.

Summary of Future State

OrchestratioN

oldhorse (preprod) renamed as oldhorse.surroindustries.com:

Fresh install of latest version of Ubuntu Server.
Puppet Master
LDAP service as secondary, a record themis
LDAP client to primary.
Rsync push script to promote preprod code to prod.
apache, virtualhosts as preprod for all sites
database, mysql
platform for all experimental services in testing (redis for example)
control point for future orchestration layer

Operations and Support Services

www_irc_more renamed as cronus.surroindustries.com

Fresh install of latest version of Ubuntu Server.
new gitlab location
puppet slave
ldap client
new znc bouncer
log collection and storage
A record as a record logs.silogroup.org

www_luxi (prod) renamed as prod.luxidolon.com

Fresh install of latest version of Ubuntu Server
puppet slave
ldap client
apache, all virtualhosts not spec’d elsewhere as their prod
database, mysql
pastebin under a record/virtualhost paste.silogroup.org

www_silo renamed as core.silogroup.org

Fresh install of latest version of Ubuntu Server
puppet slave
ldap client
apache, landing page, static content
various internal APIs that aren’t public
LDAP service as primary, given a record mnemosyne.surroindustries.com

TRANSITION PLANNING

Now that I’ve got a current state and a future state, I need to create a transition plan.

Before the next topic shift, regarding the parent org landing page, in this case, this brings a little more complexity to future state planning.

Since I don’t want to build a system to properly demote a production wordpress dataset to a preprod wordpress database, since that’ll be a headache to manage through this type of pipeline, and since these are mostly static content that i propped up with wordpress in a typically foolish “prototype as solution” fashion, we’re looking at a full redesign for the silo landing page to accommodate a sound future architectural state. I am actually fine with this since it’s a landing page, but this would likely not be the route taken for more robust needs. For people actually using the features of wordpress (not in this case), you’ll want to have an automated means to export the existing wordpress, adapt the exported payload to fit seamlessly into preprod with a script or parsing engine, and then install — I am sure there are plugins that will do this and have used some similarly in the past.

In this case I’m just going to build static content out in preprod, leave the existing prod up, and when preprod is finished and tested I’ll be able to promote to a new prod and then edit the DNS zone file to point to the new site. You can only do this for static content — if these were SOA components, like microservices, you’d want to develop iteratively a set of components at a time and that can get pretty complex.

The Plan

In my case it’s easy since I don’t have very much at all to transition, really. I’m building out most of the future state in place of mostly empty current state, with little content to migrate besides the silo landing page which will go away eventually. I’ll still need to install the OS, which is solved by a plugin I’ve used in the past called All-in-one WP Migration, and works flawlessly as far as I’ve been able to tell so far– in fact it was used already to plan the current state in one of the more proto/primal iterations of the silogroup network.

What about downtime?

In my case I’m totally okay with brief periods of downtime because I’m not providing any services to the public and it’s too early in these projects’ planning to be worried about branding impact.

In all other cases we’d build out the new first, test it thoroughly, in a reproducible way, and then migrate it to the future home for it when it was ready, with a pipeline in place.

So…

oldhorse.surroindustries.com

reinstall everything
set up puppet master
ldap service as a secondary that syncs to primary on themis@core.silogroup.org
apache, virtualhosts to match prod.luxidolon.com
database, mysql redis
secret stuff

core.silogroup.org

export and back up silogroup.org wordpress archive
reinstall everything
install puppet
setup wordpress, install plugin
import the site archive
set up LDAP service as primary for everything
configure authentication via ldap service

prod.luxidolon.com

reinstall everything, no backups to make since im wiping the current codebase
install puppet
configure ldap
set up apache and all virtualhosts for all products
database, mysql
install a flagship pastebin that paste.silogroup.org will point to

cronus.surroindustries.com

reinstall everything
install puppet
configure ldap
set up redis, mysql
set up gitlab
apache with blank virtualhost
migrate znc logs

And I’m done. And that’s how you screw up transition planning gracefully. But dammit, we’ve got CONTINUITY. Once this is in place I’m expecting to spend so much less time making changes that all the bad, but carefully incremental planning paid off. If I had more servers a plan like this gets alot more put into it, but, I don’t, I just thought it might be good reading to see what types of thoughts go into these types of transitions. SILO’s subs are probably not spread out or developed enough in concept to warrant the level of effort this type of process would deserve.