Jul 19, 2016 - LXC containers on ZFS


At work, we have a large-scale deployment at AWS on Ubuntu. As a member of the Performance and Operating Systems Engineering team, I am partially responsible for building out and stabilizing the base image we use to deploy our instances. We are currently in the process of migrating to Xenial, the current Ubuntu LTS release. There’s a lot that has to happen to go from the foundation image to our deployable image. There’s a few manual things, such as making our AWS AMI bootable on both PV and HVM instance types (we’ve shared how to do this with Canonical, but they don’t seem to interested, even though it reduces operational complexity by not having to maintain multiple base images). The vast majority of building out our image, on the other hand, is an automated process involving a relatively large and complex chef recipe, which we keep backwards compatable for all versions of Ubuntu we support for our internal customers.

All this works pretty well in practice, but iterating on a new base AMI, like we are doing now for Xenial, takes some time as we try different recipes, update init scripts (systemd is new in Xenial since the last LTS - Trusty), and various other customizations we do. Making idempotent chef recipes is difficult and not worth the effort, but also that means it’s not really possible to re-run after a failed chef recipe. The end-to-end delay in trying out changes is a fairly long process - we check package source into git, let jenkins build packages, and kick off our automated AMI build process - which involves taking our foundation image, chrooting into it, running the chef recipes, and snapshotting the EBS volume into an AMI. Now, we can finally launch an EC2 instance on the AMI and see if things worked.

This all takes a fair bit of time when rapidly iterating on our base image, and I wanted to find a quicker way to try potentially breaking changes. Even though we deploy on Ubuntu, all my personal and work laptops, desktops, and servers run base Debian. Lately, I’ve been building out all my filesystems (except for /boot) with ZFS using zfsonlinux (even on my LUKS/dm-crypt encrypted laptops).

I’ve used LXC a fair bit in the past when needing to do cross-distribution builds - and I’ve used BTRFS snapshots to make cloning containers fast and space efficient. ZFS also supports copy-on-write, and is natively supported by LXC on Debian Jessie, so this seemed like a good approach - and it is!

I’ve been using this method to iterate quickly on our recipes. I have a base xenial image that I can clone and start in a few seconds to start from the beginning. I can also snapshot a container at any point in the process so that I can repeat and retry what would otherwise not be idempotent.

Some of the ZFS integration in LXC is not well documented, so here’s some rough steps on how I’m doing this on my work desktop, to help anyone else trying to figure this out.

I started with a single ZFS pool called “pool0” with several filesystems:

$ sudo zpool list
pool0   238G   102G   136G         -    17%    42%  1.00x  ONLINE  -

$ sudo zfs list
pool0              110G   120G    96K  none
pool0/home        89.6G   120G  89.6G  /home
pool0/opt         2.61G   120G  2.61G  /opt
pool0/root        6.75G   120G  5.54G  /
pool0/swap        8.50G   129G   186M  -
pool0/tmp          728K   120G   728K  /tmp
pool0/var         2.17G   120G  2.17G  /var

In order to use ZFS volumes, I wanted a new filesystem just for /var/lib/lxc, the default location for LXC containers:

$ sudo zfs create -o mountpoint=/var/lib/lxc pool0/lxc

$ sudo zfs list pool0/lxc
pool0/lxc   539M   120G   124K  /var/lib/lxc

Next, I created my base Xenial LXC container:

$ sudo lxc-create -n xenial -t download -B zfs --zfsroot=pool0/lxc -- --dist ubuntu --release xenial --arch amd64

The “zfsroot” option is important - without it, LXC doesn’t know what pool or filesystem to use (it defaults to ‘tank/lxc’).

At this point, we have a working Xenial container - before starting it I manually edited /var/lib/lxc/xenial/etc/shadow removing the passwords for the “root” and “ubuntu” users. I then launch the container, login through the console, and change the passwords for both users. Then, I install openssh-server and stop the container - this is my base that I can now clone.

Cloning a container is easy, and takes just a couple of seconds:

$ sudo lxc-clone -s -o xenial -n try

$ sudo lxc-ls -f
NAME    STATE    IPV4           IPV6                                 AUTOSTART
try     STOPPED  -              -                                    NO
xenial  STOPPED  -              -                                    NO

$ sudo zfs list -r pool0/lxc
pool0/lxc          539M   120G   124K  /var/lib/lxc
pool0/lxc/try      124M   120G   471M  /var/lib/lxc/try/rootfs
pool0/lxc/xenial   415M   120G   415M  /var/lib/lxc/xenial/rootfs

You can see that each container is in it’s own ZFS copy-on-write volume. I can easily clone and destroy containers now without going through a full build, bake, and deploy process.

Here’s a couple more hints - If you have trouble connecting to the LXC console before openssh and networking is enabled, make sure you are connecting to the console tty (for Xenial, I was otherwise getting tty1 which has no getty):

$ sudo lxc-console -n try -t 0

Finally, by default, LXC containers will not be set up with networking. It’s easy to supply an “/etc/lxc/default.conf” to resolve this:

lxc.network.type = veth
lxc.network.link = br0

And remember that the host needs bridged networking to be configured.

May 10, 2016 - Be Careful With Apache mod_headers


Note: This post has been updated since discovering this is NOT an Apache issue, and it turns out to entirely be a problem in the request processing framework of the application Apache is proxying requests to. Some frameworks follow old CGI specs that prohibit hyphens (“-“) in request header names. Apache is passing along both it’s header and the client-generated headers, but the proxied framework converts “-“ to “_” which results in a map/dictionary key collision.

As a result, my “Do this” advice has been updated.

While doing doing some Apache TLS configuration this week for work, I came across a security edge case with mod_headers and the RequestHeaders directive.

A fairly common use-case for this is to pass TLS/SSL headers to a proxied backend service when TLS termination is done in Apache. Imagine a case where client certificates are optional but the backend uses information from the certificate, such as the DN, or just validating if a client certificate was used.

Let’s take that last case as an example to illustrate this security risk, where we wish to pass along the SSL_CLIENT_VERIFY Apache variable to a backend, indicating that a client certificate was successfully used and validated. A common, but insecure configuration (which you’ll find in many guides and blogs if you search) is to do this:

RequestHeader set SSL_CLIENT_VERIFY "%{SSL_CLIENT_VERIFY}s"  # Don't do this!

This directive will add the header “Ssl-Client-Verify” to the request passed to the backend service, however this header can be overridden and spoofed by a client!

Instead, use the following configuration, which is not vulnerable to header forgery:

RequestHeader set SSLCLIENTVERIFY "%{SSL_CLIENT_VERIFY}s"  # Do this

Some request processing frameworks follow an old CGI specification that prohibits “-“ in header names and convert these to “_”, so to prevent a client from using a map/dictionary key collision to spoof headers, avoid the use of these characters entirely.

Here’s an example of header forgery, where we can easily override the Apache generated headers when specified like the “Don’t do this” case above:

$ curl --header "Ssl-Client-Verify: SPOOFED" -i https://my.site.foo

With a valid certificate we can still override the Apache generated header:

$ curl --header "Ssl-Client-Verify: SPOOFED" --cert cert.crt --key cert.key --cacert all.cert \
       -i https://my.site.foo

This is easy to test using a simple Python flask backend service with a route like the following (for easy illustration purposes only, of course):

def root():
    print request
    print request.headers
    return ''

The resulting output will show that the client was able to override the Apache header if underscores are used in the RequestHeader directive:

Ssl-Client-Verify: SPOOFED

Whereas using either the second or third form, where dashes are used instead of underscores, the client cannot spoof the header:

Ssl-Client-Verify: SUCCESS

Or if client certifications are optional and none was provided:

Ssl-Client-Verify: (null)

This vulnerability happens if the client passes a header that matches the final header of “Ssl-Client-Verify” (case doesn’t matter, so a spoofed header of “SSL-CLIENT-VERIFY” will result in header forgery). Passing a header of “SSL_CLIENT_VERIFY” from the client will not result in a spoofed header, potentially giving a false sense of security in testing.

The security risk is pretty clear - a malconfigured Apache and backend request processing framework that munges header names can result in clients spoofing headers such that a proxied service incorrectly thinks authentication or authorization has been confirmed when indeed it has not.

Be careful, do not use “-“ or “_” for header names in RequestHeader!

May 8, 2016 - Jekyll With Isso Comments


I switched this blog over to the Isso commenting system from Disqus, and added support for Isso to my popular Jekyll theme jekyll-clean. It was always a bit of a battle getting Disqus to work right - I had quite a few comments that would not show up, and just logging into Disqus doesn’t work right if you use privacy blockers like I do (Privacy Badger, Ublock Origin, and HTTPS Everywhere for those interested - these are all worthwhile browser extensions to use). There were always some questions about what Disqus does with data, as well.

Isso is self-hosted, which means you can’t directly use it on static webhosting such as github pages, and while your data is arguably no more safe on someone’s random self-hosted blog (such as this one!), Isso allows anonymous comments - so people only have to provide as much detail as they wish. For people who want to demand it, you can make the email and name fields mandatory, but there’s no verification so in practice there’s not much point (when I come across comment forms that require an email I always give a fake one).

We’ll see if spam is an issue - Isso has a basic moderation system. That’s one benefit of hosted solutions such as Disqus - they have a shared knowledge about spammers and can make some reasonable attempts to control it, along with requiring you create account (with the obvious downside being the lack of anonymous comments I mention above).

So, in the end, it’s not a clear choice so everyone has to choose what matters most to them - there are a few other options other than Isso as well, but I liked the fact that Isso is small and simple, written in Python, and uses sqlite for storage. There’s not much to go wrong nor much attack surface for abuse.

Integrating Isso with Jekyll is pretty easy, you can take a look at jekyll-clean to see how I approached it.

On the topic of Jekyll for blogs - I switched over to Jekyll for this blog about 1+1/2 years ago and don’t regret it for a moment. It’s simple, easy to modify and theme, and super super fast.

Jan 4, 2016 - Swapping display outputs with i3


In the (unofficial) #i3 IRC channel on freenode, someone recently asked if i3 has a command to switch between display outputs. Somewhat suprisingly, i3-msg doesn’t have such a command, but I realized that one could be written pretty easily by getting the list of workspaces and finding which one is visible but not focused, then switch to that workspace. Here’s the command I came up with, shown as a full binding from ~/.i3/config:

bindsym $mod+Next exec i3-msg workspace $(i3-msg -t get_workspaces|jq '.[] | select(.visible == 'true') | select(.focused == 'false') | .name')

You’ll need to have ‘jq’ installed.

The logic for this is pretty simple:

  • Get a list of all workspaces
  • Iterate through the list
  • Find workspaces that are visible
  • Find workspaces that are not focused
  • Print the workspace names

With two displays, you can only have two visible workspaces, and only one of those will be focused. I simply find a visible workspace that is not focused.

This won’t work as-is if you have more than 2 monitors, but you could easily have a different binding for each display, filtering for a specific “output” in place of the filter that looks for the non-focused workspace.

With a little more work, you could rotate through displays by parsing things out with python or awk (or whatever) and enumerating all the displays to determine which should be “next” or “prior”.

But, if you only have two displays, like I do, the above binding will let you easily jump back and forth between outputs.

Nov 3, 2015 - autofs with NTFS


I ran into an interesting issue recently when I wanted to set up autofs with an NTFS filesystem on an external USB drive (this drive, unfortunately, could not be btrfs like everything else).

The version of autofs shipped with Debian Jessie (and many other varieties of Linux, including Ubuntu), has a bug where it passes an invalid mount option (-s) to ntfs-3g, and as a result NTFS filesystems can’t be mounted. This bug has been fixed in a newer version of autofs, but instead of building it from source (it’s not available in jessie-backports), I just did a silly hack instead.

The relevant portion of my autofs map looks like:

somedrive      -fstype=ntfs,rw :/dev/disk/by-uuid/EA59283A9207245F1

Without enabling debugging, you don’t get much useful output:

$ ls /media/somedrive
ls: cannot access /media/somedrive: No such file or directory

However, after enabling automount debugging (I just hacked up the init script), the problem becomes clear:

do_mount: /dev/disk/by-uuid/EA59283A9207245F1 /media/somedrive type ntfs options rw using module generic
mount_mount: mount(generic): calling mkdir_path /media/somedrive
mount_mount: mount(generic): calling mount -t ntfs -s -o rw /dev/disk/by-uuid/EA59283A9207245F1 /media/somedrive
spawn_mount: mtab link detected, passing -n to mount
>> ntfs-3g: Unknown option '-s'.
>> ntfs-3g 2014.2.15AR.2 integrated FUSE 28 - Third Generation NTFS Driver
>> #011#011Configuration type 7, XATTRS are on, POSIX ACLS are on
>> Copyright (C) 2005-2007 Yura Pakhuchiy
>> Copyright (C) 2006-2009 Szabolcs Szakacsits
>> Copyright (C) 2007-2014 Jean-Pierre Andre
>> Copyright (C) 2009 Erik Larsson
>> Usage:    ntfs-3g [-o option[,...]] <device|image_file> <mount_point>
>> Options:  ro (read-only mount), windows_names, uid=, gid=,
>>           umask=, fmask=, dmask=, streams_interface=.
>>           Please see the details in the manual (type: man ntfs-3g).
>> Example: ntfs-3g /dev/sda1 /mnt/windows
>> News, support and information:  http://tuxera.com
mount(generic): failed to mount /dev/disk/by-uuid/EA59283A9207245F1 (type ntfs) on /media/somedrive

My hack to fix this is pretty simple: I created a wrapper around the real ntfs-3g binary that strips the -s option out of the command line. Very brute force.

I dropped my script into /bin/ntfs-3g_wrapper, renamed /bin/ntfs-3g to /bin/ntfs-3g_real, and made a symbolic link from /bin/ntfs-3g to the wrapper.

Here’s the script, placed in /bin/ntfs-3g_wrapper:


OPTS=$(echo $@ | sed "s/ -s / /")

exec /bin/ntfs-3g_real $OPTS

And now let’s move things around and create the symlink:

$ sudo mv /bin/ntfs-3g /bin/ntfs-3g_real
$ sudo ln -s /bin/ntfs-3g_wrapper /bin/ntfs-3g

BE CAREFUL IF YOU DECIDE TO TO DO THIS - I take no responsibility if you break your system. And remember: any update to the ntfs-3g package will require at a minimum moving the binary again and recreating the symlink, or maybe even more work depending…

This is, in general, the wrong way to solve these sorts of problems - you have to remember what you did for when it breaks again, and you never know what other issues it may cause.

In my case this was a quick and dirty solution, and if a newer version of autofs ends up in jessie-backports, I can easily undo it. Hopefully this helps others, as this is apparently a pretty common problem with autofs.