The Seven Habits of Highly Effective Scientific Programming

How to painlessly leverage your computer for science

For scientists writing code for science has become somewhat of a necessary evil. They are not programmers, and the most they have in terms of programming at would would be an intensive course on Fortran or Python. They rest they just kind of, for the lack of better terms, learn on the job, often with large code based that have been developed or maintained in a similar fashion. You dread doing anything with this code, and it gets worse in time. If that describes your experience with scientific programming, this article is for you.

Here is the good news: It doesn’t take a PhD in software engineering to keep your code significantly flexible, robust, and yet easy to follow. Based on my long-time experience reading and maintaining existing scientific code, I have come up with the seven easily adaptable habits that might help you render your career as scientific programmer more fulfilling and productive.

1. No magic

function degree_to_radian ( degree )
return ( (3.14/180.0)*degree) )
function radian_to_degree ( radian )
return ( (180.0/3.14)*degree) )
function area_of_circle ( radius )
return ( 3.14*radius*radius )
function log_factorial_stirling ( n )
return ( n*(0.5*log(3.14*n)+log(n)-1.0) )

The 3.14 in the function is known as a magic number. It’s magic because it has no explicit context. The obvious problem is that someone who don’t know that it’s π is will have no clue what is going on. Even if you think to yourself, “ isn’t that obvious,” its variants may not: π/4 (0.7854), 1/π (0.3183), π² (9.8696), log(π) (0.4971). The list goes on, but I think a lot less people would think these obvious.

Another problem: say you want to change 3.14 to 3.14159265. You will have to change every single instance of 3.14 in the code. Isn’t that what a global search and replace is for? But, what if you typed one of them wrong, or when 3.14 is sprinkled across different source files? What you thought was quick thing ended up taking much more time to fix and update.

Instead, you can eliminate this problem by defining a constant for π, and use that in your function instead. Like this:

constant PI = 3.14
constant DEG180 = 180.0
function degree_to_radian ( degree )
return ( (PI/DEG180)*degree )
function radian_to_degree ( radian )
return ( (PI/DEG180)*degree) )
function area_of_circle ( radius )
return ( PI*radius*radius )
function log_factorial_stirling ( n )
return ( n*(0.5*log(PI*n)+log(n)-1.0) )

This way, you can make sure that every time the constant PI shows up, it is 3.14. Or, later on, if you want add more significant digits, you only need changing it in one location.

2. Avoid real number comparisons (within reason)

constant DEG360 = 360
integer angle
# Some code for calculating the value of angleif ( angle == DEG360 )
print ( "I just went full circle" )

But this only works for every degree because angle is an integer. Say we want to add more precision by using a real number type, we could easily modify the code as follows:

constant DEG360 = 360.0
real angle
# Some code for calculating the value of angleif ( angle == DEG360 )
print ( "I just went full circle" )

It’s so straightforward it’s hard to screw this one up, right? Not quite. You see, the precision at which a computer stores real numbers is not infinite. What happens beyond that, no one knows. Because of this, errors accumulate over operations. So by the time the variable angle reaches the if-statement, it could be that it is slightly above or below zero, which means that the computer still doesn’t think it‘s gone full circle when it actually did.

You might have heard of “single” or “double” precision real numbers. This translates correspondingly to six to sixteen significant digits. If you really want something more precise, you could opt for something called “long double”, which could give you up to 33 significant digits, but there is always a limit to its precision.

Of course, it is best to avoid making decisions based on real number comparison. However there might be occasions you cannot get around it. In this case, try to limit your comparison to greater than or less than type with a value comfortably away from some kind of threshold.

constant THRESHOLD = 0.001
constant DEG360 = 360.0
real angle
# Some code for calculating the value of angleif ( (angle-DEG360) < THRESHOLD )
print ( "I just went full circle" )

However, due to precision issues, the value of (angle-DEG360) could be negative and the comparison will fail. To avoid this, amend the if-statement with the absolute value of the difference:

if ( abs(angle-DEG360) < THRESHOLD )

where the difference will be unconditionally positive, and the comparison will always work.

3. Break down the code into simple, repeatable units

While my hat is off to my predecessors for having such perseverance for producing such quality code, it is obvious that he has not thought about to approach the problem. I rewrote it (the 2000+ line code) and it was about 300 lines, with comments. I will let some other lucky person deal with the almost 10k line function.

I must admit, this takes some experience and a modicum of understanding of computer algorithms. But as scientist you should be able to see patterns and understand how things should be broken down to repeating logical units. As an example, think about your very long day. Let us write some pseudo code for that:

function your_day
get_up
take_shower
leave_house
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
eat_breakfast
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
eat_lunch
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
eat_dinner
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
go_home
get_night_cap
sleep

You can see that the working bit is a bi repetitive. So we can reorganize the code of your day to something like this:

function return_to_work
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
function your_day
get_up
brush_teeth
take_shower
leave_house
return_to_work
eat_breakfast
return_to_work
eat_lunch
return_to_work
eat_dinner
return_to_work
go_home
change_clothes
get_night_cap
sleep

It’s literally half the size of the original pseudo code, and it is much easier to understand. There is a caveat, though. One can go overboard with the level of abstraction, like this example here:

function at_home ( activity, mode )
if ( activity == going_to_work )
go_to_work ( mode )
if ( activity == coming_home )
come_home ( mode )
function go_to_work ( mode )
get_up
if ( mode == normal )
brush_teeth ( morning )
take_shower ( morning )
if ( mode == slow )
read_newspaper
leave_house
function come_home ( mode )
change_clothes
if ( mode == night_cap )
get_night_cap
if ( mode == easy )
watch_tv
sleep
function eat ( meal, type )
if ( meal == breakfast )
eat_breakfast ( type )
if ( meal == lunch )
eat_lunch ( type )
if ( meal == dinner )
eat lunch ( type )
function eat_breakfast ( type )
if ( type == cereal )
take_bowl
pour_milk
add_cereal
if ( type == full )
fry_bacon
fry_eggs
fry_tomatoes
function eat_lunch ( type )
if ( type == salad )
get_salad
if ( type == buffet )
go_to_buffet_restaurant
if ( type == friday )
go_to_biergarten
function eat_dinner ( type )
if ( type == leftover )
micro
if ( type == frozen )
microwave
function return_to_work
return_to_office
log_in
write_code
go_to_meeting
get_on_the_phone
log_out
leave_office
function your_day
at_home ( going_to_work, slow )
return_to_work
eat ( breakfast )
return_to_work
eat ( lunch )
return_to_work
eat ( dinner )
return_to_work
at_home ( coming_home, night_cap )

Of course, you can reorganize it a bit better so that it is a bit less convoluted, but you get my point. So keep your code modular but don’t overdo it to the point it cannot be easily maintained.

4. Be weary of global variables

int file_handlefunction open_file_A ( filename_A )
file_handle = open ( filename_A )
function open_file_B ( filename_B )
file_handle = open ( filename_B )
function do_something_with_file_A
read_from_file_A ( file_handle )

Since the variable file_handle is global and can be changed either function open_file_A and open_file_B, if the function do_something_with_file_A is called after open_file_B is called, you have a problem. The obvious thing to do is not to use file_handle as a global variable:

function open_file_A ( filename_A )
int file_handle
file_handle = open ( filename_A )
return file_handle
function open_file_B ( filename_B )
int file_handle
file_handle = open ( filename_B )
return file_handle
function do_something_with_file_A ( filename_A )
int file_handle
file_handle = open_file_A ( filenname_A )

Those who works on simulation model codes will argue that they really need global access because the structure of these codes are so complex it will take a long time to understand it enough to maintain it. To a certain degree I agree with this. If you find yourself needing global access, you should take every effort to treat them as read-only, and only change them globally if it is absolutely necessary. Fortunately, some codes already have a read-only registry, where everything can be accessed easily, but still safe from accidental user changes.

5. Narrow down operating region quickly

point2D point_set[np]
point2D grid[ni][nj]
for k = 1, np
shortest_distance = a_very_large_number
closest_cell_location = (0,0)
for i = 1, ni
for j = 1, nj
this_dist = get_dist ( point_set[k], grid[i][j] )
if ( this_dist < shortest_distance )
shortest_distance = this_dist
closest_cell_location = (i,j)

For each point in the point set, it goes through the entire grid to locate the cloeset cell. The number of look-ups (i.e., distance comparison) is np×ni×nj. When each dimension is 1000, you are talking about 1 billion calculations, which is a lot. The code in question, as a whole, took two weeks to run.

Of course, there is a smarter way of doing the same thing. Since the x and y coordinates in the grid are sorted, even the simple to implement bisection search will improve things significantly:

point2D point_set[np]
print2D grid[ni][nj]
for k = 1, np
shortest_distance = a_verg_large_number
i0 = 1 i1 = ni
j0 = 1 j1 = nj
while ( (i1>i0) and (j1>j0) )
(iNE,jNE) = get_NE_quadrant_grid_center ( i0, i1, j0, j1 )
(iNW,jNW) = get_NW_quadrant_grid_center ( i0, i1, j0, j1 )
(iSE,jSE) = get_SE_quadrant_grid_center ( i0, i1, j0, j1 )
(iSW,jSW) = get_SW_quadrant_grid_center ( i0, i1, j0, j1 )
distNE = get_dist ( point_set[k], grid[iNE][jNE] )
distNW = get_dist ( point_set[k], grid[iNW][jNW] )
distSE = get_dist ( point_set[k], grid[iSE][jSE] )
distSW = get_dist ( point_set[k], grid[iSW][jSW] )
if distNE is minimum
i0 = i1 / 2
j0 = j1 / 2
if distNW is minimum
i1 = i1 / 2
j0 = j1 / 2
if distSE is minimum
i0 = i1 / 2
j1 = j1 / 2
if distSW is minimum
i1 = i1 / 2
j1 = j1 / 2

In this approach, for each point in the point set, four distance calculations are made, and the number of inner loops required is proportional to the logarithm (base 2) of the problem size, instead of its square. So for the same problem size, only a total of 1000 × 4 × 10 = 40000 distance calculations are needed, about 5 orders of magnitude lower than the naive approach. Requiring only about 30 minutes to program and test, this program would complete in less than a minute.

Even when we are living in a time where terabytes of output data is routine, you don’t need to be well versed in sophisticated data structures or expensive hardware to speed up your code. A simple algorithm for quickly narrowing down your region of interest will be enough to give you quick feedback.

6. Make your module unit testable

It is instructive to first develop your feature in a standalone module, test it in isolate until you know it is behaving in the way you expect. Only then you can talk about integrating your feature into the code base. Even though you might need to spend some time in building the standalone infrastructure to mimic the project code base in a minimal way, this approach will still give you a much faster feedback, letting you know whether you are on the right path early on. You can focus on making sure the feature is working first. So when the seg faults in the project code, you know the problem lies on the integration, and not on the correctness of your code behavior.

7. Take your scientific colleagues’ words with a huge grain of salt

The Little Guy talking about Little Things everyone talks about.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store