The Seven Habits of Highly Effective Scientific Programming

How to painlessly leverage your computer for science

For scientists writing code for science has become somewhat of a necessary evil. They are not programmers, and the most they have in terms of programming at would would be an intensive course on Fortran or Python. They rest they just kind of, for the lack of better terms, learn on the job, often with large code based that have been developed or maintained in a similar fashion. You dread doing anything with this code, and it gets worse in time. If that describes your experience with scientific programming, this article is for you.

Here is the good news: It doesn’t take a PhD in software engineering to keep your code significantly flexible, robust, and yet easy to follow. Based on my long-time experience reading and maintaining existing scientific code, I have come up with the seven easily adaptable habits that might help you render your career as scientific programmer more fulfilling and productive.

We have all be tempted at some point of career as developer. You need a constant pronto, like π. Hey, it’s not gonna change, so let’s just hard code it like this:

function degree_to_radian ( degree )
return ( (3.14/180.0)*degree) )
function radian_to_degree ( radian )
return ( (180.0/3.14)*degree) )
function area_of_circle ( radius )
return ( 3.14*radius*radius )
function log_factorial_stirling ( n )
return ( n*(0.5*log(3.14*n)+log(n)-1.0) )

The 3.14 in the function is known as a magic number. It’s magic because it has no explicit context. The obvious problem is that someone who don’t know that it’s π is will have no clue what is going on. Even if you think to yourself, “ isn’t that obvious,” its variants may not: π/4 (0.7854), 1/π (0.3183), π² (9.8696), log(π) (0.4971). The list goes on, but I think a lot less people would think these obvious.

Another problem: say you want to change 3.14 to 3.14159265. You will have to change every single instance of 3.14 in the code. Isn’t that what a global search and replace is for? But, what if you typed one of them wrong, or when 3.14 is sprinkled across different source files? What you thought was quick thing ended up taking much more time to fix and update.

Instead, you can eliminate this problem by defining a constant for π, and use that in your function instead. Like this:

constant PI = 3.14
constant DEG180 = 180.0
function degree_to_radian ( degree )
return ( (PI/DEG180)*degree )
function radian_to_degree ( radian )
return ( (PI/DEG180)*degree) )
function area_of_circle ( radius )
return ( PI*radius*radius )
function log_factorial_stirling ( n )
return ( n*(0.5*log(PI*n)+log(n)-1.0) )

This way, you can make sure that every time the constant PI shows up, it is 3.14. Or, later on, if you want add more significant digits, you only need changing it in one location.

The code snippet below will print something out every time the variable angle reaches the value of 0.

constant DEG360 = 360
integer angle
# Some code for calculating the value of angleif ( angle == DEG360 )
print ( "I just went full circle" )

But this only works for every degree because angle is an integer. Say we want to add more precision by using a real number type, we could easily modify the code as follows:

constant DEG360 = 360.0
real angle
# Some code for calculating the value of angleif ( angle == DEG360 )
print ( "I just went full circle" )

It’s so straightforward it’s hard to screw this one up, right? Not quite. You see, the precision at which a computer stores real numbers is not infinite. What happens beyond that, no one knows. Because of this, errors accumulate over operations. So by the time the variable angle reaches the if-statement, it could be that it is slightly above or below zero, which means that the computer still doesn’t think it‘s gone full circle when it actually did.

You might have heard of “single” or “double” precision real numbers. This translates correspondingly to six to sixteen significant digits. If you really want something more precise, you could opt for something called “long double”, which could give you up to 33 significant digits, but there is always a limit to its precision.

Of course, it is best to avoid making decisions based on real number comparison. However there might be occasions you cannot get around it. In this case, try to limit your comparison to greater than or less than type with a value comfortably away from some kind of threshold.

constant THRESHOLD = 0.001
constant DEG360 = 360.0
real angle
# Some code for calculating the value of angleif ( (angle-DEG360) < THRESHOLD )
print ( "I just went full circle" )

However, due to precision issues, the value of (angle-DEG360) could be negative and the comparison will fail. To avoid this, amend the if-statement with the absolute value of the difference:

if ( abs(angle-DEG360) < THRESHOLD )

where the difference will be unconditionally positive, and the comparison will always work.

When I started on my current project, I inherited a part of code that contains of a 2000+ line procedure, most of which are repetitions of the same if / else if / else statements. This is outdone by another part of the same code, which consists of 9800 lines of copy-n-pasted decision structures in a single function.

While my hat is off to my predecessors for having such perseverance for producing such quality code, it is obvious that he has not thought about to approach the problem. I rewrote it (the 2000+ line code) and it was about 300 lines, with comments. I will let some other lucky person deal with the almost 10k line function.

I must admit, this takes some experience and a modicum of understanding of computer algorithms. But as scientist you should be able to see patterns and understand how things should be broken down to repeating logical units. As an example, think about your very long day. Let us write some pseudo code for that:

function your_day
get_up
take_shower
leave_house
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
eat_breakfast
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
eat_lunch
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
eat_dinner
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
go_home
get_night_cap
sleep

You can see that the working bit is a bi repetitive. So we can reorganize the code of your day to something like this:

function return_to_work
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
function your_day
get_up
brush_teeth
take_shower
leave_house
return_to_work
eat_breakfast
return_to_work
eat_lunch
return_to_work
eat_dinner
return_to_work
go_home
change_clothes
get_night_cap
sleep

It’s literally half the size of the original pseudo code, and it is much easier to understand. There is a caveat, though. One can go overboard with the level of abstraction, like this example here:

function at_home ( activity, mode )
if ( activity == going_to_work )
go_to_work ( mode )
if ( activity == coming_home )
come_home ( mode )
function go_to_work ( mode )
get_up
if ( mode == normal )
brush_teeth ( morning )
take_shower ( morning )
if ( mode == slow )
read_newspaper
leave_house
function come_home ( mode )
change_clothes
if ( mode == night_cap )
get_night_cap
if ( mode == easy )
watch_tv
sleep
function eat ( meal, type )
if ( meal == breakfast )
eat_breakfast ( type )
if ( meal == lunch )
eat_lunch ( type )
if ( meal == dinner )
eat lunch ( type )
function eat_breakfast ( type )
if ( type == cereal )
take_bowl
pour_milk
add_cereal
if ( type == full )
fry_bacon
fry_eggs
fry_tomatoes
function eat_lunch ( type )
if ( type == salad )
get_salad
if ( type == buffet )
go_to_buffet_restaurant
if ( type == friday )
go_to_biergarten
function eat_dinner ( type )
if ( type == leftover )
micro
if ( type == frozen )
microwave
function return_to_work
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
function your_day
at_home ( going_to_work, slow )
return_to_work
eat ( breakfast )
return_to_work
eat ( lunch )
return_to_work
eat ( dinner )
return_to_work
at_home ( coming_home, night_cap )

Of course, you can reorganize it a bit better so that it is a bit less convoluted, but you get my point. So keep your code modular but don’t overdo it to the point it cannot be easily maintained.

Everybody loves global variables; it’s right there for you and you can do anything with it. On the flip side, what you can change, can also be changed by anyone working on your project. That means you could break each other’s code just by changing these global variables. Here is an abridged version of a situation that I have noticed when I was approached by a colleague:

int file_handlefunction open_file_A ( filename_A )
file_handle = open ( filename_A )
function open_file_B ( filename_B )
file_handle = open ( filename_B )
function do_something_with_file_A
read_from_file_A ( file_handle )

Since the variable file_handle is global and can be changed either function open_file_A and open_file_B, if the function do_something_with_file_A is called after open_file_B is called, you have a problem. The obvious thing to do is not to use file_handle as a global variable:

function open_file_A ( filename_A )
int file_handle
file_handle = open ( filename_A )
return file_handle
function open_file_B ( filename_B )
int file_handle
file_handle = open ( filename_B )
return file_handle
function do_something_with_file_A ( filename_A )
int file_handle
file_handle = open_file_A ( filenname_A )

Those who works on simulation model codes will argue that they really need global access because the structure of these codes are so complex it will take a long time to understand it enough to maintain it. To a certain degree I agree with this. If you find yourself needing global access, you should take every effort to treat them as read-only, and only change them globally if it is absolutely necessary. Fortunately, some codes already have a read-only registry, where everything can be accessed easily, but still safe from accidental user changes.

This is a very common situation: You have a set of point coordinates, and you want to know what is the closest location in a set of Cartesian grid locations. For the sake of argument, let’s consider 2D. A naive approach, as I have seen in some of the codes I have personally seen, look something like this:

point2D point_set[np]
point2D grid[ni][nj]
for k = 1, np
shortest_distance = a_very_large_number
closest_cell_location = (0,0)
for i = 1, ni
for j = 1, nj
this_dist = get_dist ( point_set[k], grid[i][j] )
if ( this_dist < shortest_distance )
shortest_distance = this_dist
closest_cell_location = (i,j)

For each point in the point set, it goes through the entire grid to locate the cloeset cell. The number of look-ups (i.e., distance comparison) is np×ni×nj. When each dimension is 1000, you are talking about 1 billion calculations, which is a lot. The code in question, as a whole, took two weeks to run.

Of course, there is a smarter way of doing the same thing. Since the x and y coordinates in the grid are sorted, even the simple to implement bisection search will improve things significantly:

point2D point_set[np]
print2D grid[ni][nj]
for k = 1, np
shortest_distance = a_verg_large_number
i0 = 1 i1 = ni
j0 = 1 j1 = nj
while ( (i1>i0) and (j1>j0) )
(iNE,jNE) = get_NE_quadrant_grid_center ( i0, i1, j0, j1 )
(iNW,jNW) = get_NW_quadrant_grid_center ( i0, i1, j0, j1 )
(iSE,jSE) = get_SE_quadrant_grid_center ( i0, i1, j0, j1 )
(iSW,jSW) = get_SW_quadrant_grid_center ( i0, i1, j0, j1 )
distNE = get_dist ( point_set[k], grid[iNE][jNE] )
distNW = get_dist ( point_set[k], grid[iNW][jNW] )
distSE = get_dist ( point_set[k], grid[iSE][jSE] )
distSW = get_dist ( point_set[k], grid[iSW][jSW] )
if distNE is minimum
i0 = i1 / 2
j0 = j1 / 2
if distNW is minimum
i1 = i1 / 2
j0 = j1 / 2
if distSE is minimum
i0 = i1 / 2
j1 = j1 / 2
if distSW is minimum
i1 = i1 / 2
j1 = j1 / 2

In this approach, for each point in the point set, four distance calculations are made, and the number of inner loops required is proportional to the logarithm (base 2) of the problem size, instead of its square. So for the same problem size, only a total of 1000 × 4 × 10 = 40000 distance calculations are needed, about 5 orders of magnitude lower than the naive approach. Requiring only about 30 minutes to program and test, this program would complete in less than a minute.

Even when we are living in a time where terabytes of output data is routine, you don’t need to be well versed in sophisticated data structures or expensive hardware to speed up your code. A simple algorithm for quickly narrowing down your region of interest will be enough to give you quick feedback.

You have to add new features to a very large code project. It is literally a maze and you don’t know where to start. It took you week to figure out where you need to edit the code, and another two months to do all the programming. And it took you two months and two weeks to see your first segmentation fault, and you don’t know why. So you spend another month figuring out why before you see your next segmentation fault. After a few rounds of that, you realize the approach doesn’t work, and almost a year has passed.

It is instructive to first develop your feature in a standalone module, test it in isolate until you know it is behaving in the way you expect. Only then you can talk about integrating your feature into the code base. Even though you might need to spend some time in building the standalone infrastructure to mimic the project code base in a minimal way, this approach will still give you a much faster feedback, letting you know whether you are on the right path early on. You can focus on making sure the feature is working first. So when the seg faults in the project code, you know the problem lies on the integration, and not on the correctness of your code behavior.

I have a colleague who wanted me to do things his way, always claiming that it would vectorize better on hardware. Looking through his code, I could quickly conclude that no amount of vectorization would help speed it up beyond a slightly more cleverly designed data structure and algorithm. As a newcomer, the your colleagues might be able to show you the ropes, do know that they probably have the same kind of training in the way of computer development. So do take their opinions with a healthy dose of skepticism, especially when it comes to design and architecture.

The Little Guy talking about Little Things everyone talks about.