Jan 27, 2015 - Atom editor on Wheezy

If you have systems running Debian Wheezy and want to check out the Atom editor, here’s a great solution which doesn’t require compromising your perfectly working system by installing cross-release libraries from testing or unstable. It takes a bit to get it setup, but it results in having a private version of libc just for Atom. The same technique can be used for any binary that requires a newer version of libc.

How do you know you need to do this? Because you’ll get the following when starting Atom:

/usr/share/atom/atom: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /usr/share/atom/atom)
/usr/share/atom/atom: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /usr/share/atom/libchromiumcontent.so)
/usr/share/atom/atom: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /usr/share/atom/libgcrypt.so.11)
/usr/share/atom/atom: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.15' not found (required by /usr/share/atom/libgcrypt.so.11)

Update: Atom works out of the box on Debian Jessie - so no need to do anything special there. Since Jessie is now the current version of Debian, it’s time to upgrade if you are still running into this.

Jan 26, 2015 - CLOSE_WAIT socket leak in java.net.HttpURLConnection

I came across an interesting socket leak a few weeks ago which was maddeningly difficult to track down and reproduce so it could be fixed.

This all started when I noticed a large number of CLOSE_WAIT sockets to the AWS metadata service, 169.254.169.254 on every AWS instance - see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html :

tcp6       1      0 10.0.0.1:43396      169.254.169.254:80      CLOSE_WAIT
tcp6       1      0 10.0.0.1:43438      169.254.169.254:80      CLOSE_WAIT
tcp6       1      0 10.0.0.1:43045      169.254.169.254:80      CLOSE_WAIT
tcp6       1      0 10.0.0.1:43317      169.254.169.254:80      CLOSE_WAIT
[...]
tcp6       1      0 10.0.0.1:43388      169.254.169.254:80      CLOSE_WAIT
tcp6       1      0 10.0.0.1:43279      169.254.169.254:80      CLOSE_WAIT
tcp6       1      0 10.0.0.1:43357      169.254.169.254:80      CLOSE_WAIT
tcp6       1      0 10.0.0.1:43134      169.254.169.254:80      CLOSE_WAIT

The number of these would increase, by about 2 a minute, until the kernel does it’s housekeeping for these sort of sockets - at max, there’d be around 120-130 of them.

CLOSE_WAIT means the socket has been closed by the TCP stack but is waiting on the application (Java in this case) to acknowledge the closure and clean things up from it’s side. So it seems to clearly be a bug in our code, but I wasn’t sure where. I used tcpdump (“sudo tcpdump -A host 169.254.169.254”) to see which requests were being made and made an educated guess on which service might be responsible, eventually tracking it down. This code was a couple of years old, and didn’t change except that some new metadata requests were added for VPC support, but the problem didn’t occur with the older version of the client library.

It seemed strange that new requests would cause this, but after some more tcpdumps I discovered that one of the new requests always results in a HTTP 404 from the metadata service, and these requests always resulted in a CLOSE_WAIT socket (matching up the ports in tcpdump with those from netstat). This request was only useful in VPC, but running it in EC2 Classic isn’t really a problem, it just means no metadata for that case.

It seemed pretty obvious that somehow the code must not be closing the socket after a 404, but as another engineer and I looked at the code, it didn’t actually do anything different whether it was an HTTP 200 or a 404, aside from parsing the request or not. We tried making sure all data was read from the InputStream, closing the InputStream more aggressively, calling disconnect() explicitly, and a few other things, but nothing helped. The code uses Java’s HttpURLConnection: http://docs.oracle.com/javase/7/docs/api/java/net/HttpURLConnection.html

After trying all these different things, the other engineer noticed a method on HttpURLConnection called getErrorStream() - documented as “Returns the error stream if the connection failed but the server sent useful data nonetheless”.

Turns out that whoever wrote HttpURLConnection (who I refer to as “bonehead”) decided to implement it sort of emulating stdout/stderr - if a request is a successful HTTP result code, then the output goes into one stream, but if it is an error then the output goes into a different stream. This doesn’t really help with coding because either way you still have to check the HTTP response code to see if it failed or not, but with this implementation you need another whole code block to process the ErrorStream if there was an error.

The docs for HttpURLConnection only say the following, you’ll note they don’t say anything about closing the ErrorStream: “Calling the close() methods on the InputStream or OutputStream of an HttpURLConnection after a request may free network resources associated with this instance but has no effect on any shared persistent connection. Calling the disconnect() method may close the underlying socket if a persistent connection is otherwise idle at that time”.

The fix was simple - a block of code where if the HTTP response code is not a 200, instead of just returning failure, we get the ErrorStream, read all the data off of it (discarding it), then close it.

Nov 1, 2014 - A Free 1GB RAM Upgrade

Last spring a built a cheap little Linux box for a server, consolidating NAS storage and a few other functions (fetchmail receiver, IMAP server, IRC bouncer, btsync master, Plex server, network backups, etc). Although I wasn’t on any particular budget, I was going for cheap but decent, with noise and power consumption in mind, and ended up paying around $250 for a case, mobo, CPU, and RAM. I already had a 1TB SATA drive from a Buffalo Cloudstor NAS (what a slow piece of garbage - except for the drive).

I went with an AMD A6-6400K (dual core 3.9GHz). It has an onboard Radeon GPU, which I don’t care about as this is a headless server. What I didn’t know at the time, is that this integrated GPU uses system RAM, and the MSI mainboard I got, an FM2-A75IA-E53, is somewhat limited in what options it exposes for managing this.

So I built the system, and it ran great, but I noticed only 3GB of RAM was available to Linux, even though the BIOS shows 4GB, and lshw showed a 4GB module installed. Suspecting the BIOS was allocating RAM for the GPU, I double checked the settings, but there was simply no way to choose the amount of RAM consumed by the GPU.

I found a BIOS update, but then discovered there was no way to install the update - you cannot update the BIOS from within the BIOS unless you have Windows installed (what the heck, who came up with this idea?) - and in fact the BIOS updates are distributed as a .exe file. Useless to me!

At this point, I gave up and just ignored the issue. 3GB really is enough RAM for this server, and I just lost interest in trying to solve it for a few months. I could have returned the motherboard, but in the end decided dealing with Fry’s return department wasn’t worth the effort and 3GB was enough versus the hassle. So, I forgot all about it for months.

A couple of weeks ago, I remembered this whole thing and decided to try and figure it out again. I started with the BIOS update, and found that it could be done by creating an MS-DOS bootable USB stick, and found the instructions to do that. And it worked! Great, now that I’m on the latest BIOS I bet they fixed this.

My hopes were quickly dashed as I discovered there were still no options in the MSI BIOS to control RAM for the GPU. I started to think that maybe it wasn’t this simple - maybe there was something else at fault here. Maybe the BIOS wasn’t exposing things correctly to Linux, or maybe some esoteric IOMMU driver was needed, or the Linux kernel needed to coordinate with the GPU, so I installed the radeon driver hoping that maybe Linux had to ask for the RAM back once aware of the GPU. All of this, whether crazy or not, led nowhere. And yes, of course I am running a 64bit kernel.

I did a ton of Google searches and research on this particular problem, back after building the system, then more recently as I worked through this problem. I found vague references to others with similar issues, but nobody had a solution that jived with the problem I was having. Searches of MSI’s own support site also turned up nothing, and I was about to check with their tech support site (perhaps I should have done this from the beginning), when I thought I’d try one last thing - was there a way to disable the GPU? I didn’t really want to go down this road, as while the server is headless, I do occasionally plug it in to a monitor. I was also a bit concerned that if I disabled the GPU then the system wouldn’t POST at all and I’d have to force the CMOS to be cleared on the BIOS and figure out how to get things back. That all seemed like more effort than just ignoring the missing 1GB of RAM to me.

So I duckduckgo’d on how to disable the GPU in the BIOS, wading through article after article and several posts on MSI’s tech support site when I came across this post. Oddly, I had never come across this post in all my searches about how to reduce the GPU memory, but this seemed to indicate there was a very sneaky trick - to reduce GPU RAM, you have to select “Dual Graphics”, then (and only then) will the MSI BIOS expose the option to change RAM allocation.

This is so very frustrating of MSI to bury this option inside some other option that no sane person would think to try. It seems so simple now that I know the answer, but just like a video game cheat code, you have to know the exact right set of options in order to expose this one.

So now, I was able to reduce the 1GB allocation to a minimum of 32MB. That’s still a small chunk of wasted RAM, but 32MB doesn’t make me lose any sleep.

So I rebooted, and… YES! Linux now has 4(ish)GB RAM available.

Oct 23, 2014 - VW Electric Golf (eGolf) Spotted

Yesterday I was behind a Volkswagen Electric Golf (VW eGolf) with manufacturer plates (Michigan) - the vehicle goes on sale in November, and it was clearly a production model with no attempt at stealth.

Here are a couple of pictures, Enjoy!

eGolf photo 1

eGolf photo 2

I’ve read that VW decided to air-cool the batteries, removing the liquid-cooled thermal management system (TMS) that any proper EV with lithium-ion batteries needs. Nissan found this out the hard way with the Leaf and the resulting class-action lawsuit when their batteries start to die after just a few years. Amazingly, Nissan never fixed this engineering error, and it looks like VW is going down the same road.

For comparison, the Tesla, Volt, and Spark EV (the latter I have), all use a liquid-cooled TMS to insure batteries stay at a proper operating temperature.

Sep 30, 2014 - Nvidia driver on backported kernel in Wheezy

Update 2015-01-05: This is no longer necessary. When I wrote this, the current version of nvidia-kernel-dkms was 319.82. The current version in the Wheezy Debian repository (as I write this comment) is 340.65 which resolves this bug, at least on the current backported kernel (3.16.0-0.bpo-4). If you are still encountering this issue, please make sure you are on both the latest kernel and nvidia driver.

If you like to use a backported kernel in Debian Wheezy, like I do, and have systems with Nvidia graphics cards, you’ve likely discovered an incompatibility between recent kernels (since 3.13) and the Nvidia driver.

Because of this problem, I’ve been using the Nouveau driver since 3.13, but today I decided to give nvidia-kernel-dkms another try after getting upgraded to the 3.16 kernel.

I was disappointed to find the same incompatibility still exists. Specifically, the problem is that a acpi_os_wait_events_complete was removed in the Linux kernel, but the Nvidia driver still references it. You’ll you know you’ve hit this problem when X doesn’t start and dmesg or /var/log/syslog contains:

kernel: [   27.487723] nvidia: Unknown symbol acpi_os_wait_events_complete (err 0)

Instead of just doing my usual thing of switching back to the Nouveau driver, I was curious how hard it would be to patch the driver. I know the patch for this exists, and part of the whole point of DKMS is to make it easy to compile drivers that are not part of the core kernel.

As it turns out, this problem is really, really easy to fix and you should be running with the Nvidia driver in just a few minutes if you follow these instructions.

Instructions

  • This assumes you are already running on a backported kernel. I’m using 3.16, but this should work for 3.13 and 3.14 kernels as well (there was no backport of 3.15 in Wheezy, as far as I am aware).
  • First, follow the instructions here on the Debian wiki under the Wheezy backports section. Make sure you follow all of those 6 steps exactly.
  • After rebooting, you’ll have no X display and see the Unknown symbol error message seen above.
  • Switch to a VT (Alt-F1) and login to your system.
  • As root, edit the file /usr/src/nvidia-current-319.82/nv-acpi.c (As of this time, the 319-82 part may change if the backport is updated, and of course the contents of the file might change as well - you may have to find the section in question). Comment out lines 304-310 so the file looks like:
/*
    if (pNvAcpiObject->notify_handler_installed)
    {
        NV_ACPI_OS_WAIT_EVENTS_COMPLETE();

        // remove event notifier
        status = acpi_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY, nv_acpi_event);
    }
*/
  • Now run the following command to rebuild the Nvidia DKMS module:
$ sudo dpkg-reconfigure nvidia-kernel-dkms
  • And finally, restart your display manager (I use lightdm, so yours might be different):
$ sudo /etc/init.d/lightdm stop
$ sudo /etc/init.d/lightdm start
  • And that’s it! There’s no need to reboot, X will load the Nvidia driver automatically as long as you set up /etc/X11/xorg.conf.d/20-nvidia.conf per the Debian wiki’s instructions.

Of course, if you get a system update that changes the nv-acpi.c file, then you’ll have to reapply the patch. As long as that file isn’t changed, things will otherwise be fine, even if your kernel gets upgraded, since a new DKMS module will be built from the modified source.