Tuesday, May 18, 2010

Linux issues

Linux is often mentioned as a reliable server platform. In the last few days I got two error messages from users that really caused confusion:

The first case is running a large GAMS job, that terminates with:

*** Error: Could not spawn gamscmex, rc = 1
gams: Could not spawn gamscmex, rc = 1

After asking more questions and running a small GAMS test model that allocates > 2 GB, we could conclude the server is running out of memory and the OS starts to kill some large processes, gamscmex being one of them. The message is of course not very informative. May be perror/strerror would have given a better message.

Apparently Linux needs to get rid of processes now and then. Not thinking this through I would think it would be better if it would not kill a process but rather does not parcel out new memory to user processes (i.e. let the malloc fail as soon as memory is exhausted, instead of giving out more memory than you have). Well, anyway. It seems that candidates are processes that started recently but with lots of memory requirements (http://linux-mm.org/OOM_Killer). Here is more info about why: http://lwn.net/Articles/317814/; Linux seems to allow over-committing memory by default. One of the reasons is the fork system call: this can often cause larger memory requests than actually needed – think about a large process spawning a small process, hence the invention of allowing this over-commitment. (Here is a funny analog: http://lwn.net/Articles/104185/).

Here is another one. When gdxls encounters an error (in this case the properties file was not available or incorrectly named) it will print a Java stack trace. Well, in some cases a user saw:

GDXLS V 0.3, Amsterdam Optimization (c) 2008-2010

spcolprop.txt (No such file or directory)

Exception in thread "main" java.lang.NullPointerException
*** Got java.lang.NullPointerException while trying to print stack trace.

I suspect somehow a problem in the Java runtime here.

Good error messages are very important and can really reduce support costs. Error messages are issued when the user is possibly doing something wrong. Confusing an already confused user by a confusing error message only increases the confusion. The messages mentioned in this post are a good example of the indirect cost of lousy messages: users email the problem, even I don’t understand what is happening, and a lot of back-and-forth is needed before we can (approximately) conclude what is going on.