The importance of free software to science

4 months ago 7

June 4, 2025

This article was contributed by Lee Phillips

Free software plays a critical role in science, both in research and in disseminating it. Aspects of software freedom are directly relevant to simulation, analysis, document preparation and preservation, security, reproducibility, and usability. Free software brings practical and specific advantages, beyond just its ideological roots, to science, while proprietary software comes with equally specific risks. As a practicing scientist, I would like to help others—scientists or not—see the benefits from free software in science.

Although there is an implicit philosophical stance here—that reproducibility and openness in science are desirable, for instance—it is simply a fact that a working scientist will use the best tools for the job, even if those might not strictly conform to the laudable goals of the free-software movement. It turns out that free software, by virtue of its freedom, is often the best tool for the job.

Reproducing results

Scientific progress depends, at its core, on reproducibility. Traditionally, this referred to the results of experiments: it should be possible to attempt their replication by following the procedures described in papers. In the case of a failure to replicate the results, there should be enough information in the paper to make that finding meaningful.

The staff here at LWN.net really appreciate the subscribers who make our work possible. Is there a chance we could interest you in becoming one of them?

The use of computers in science adds some extra dimensions to this concept. If the conclusions depend on some complex data massaging using a computer program, another researcher should be able to run the same program on the original or new data. Simulations should be reproducible by running the identical simulation code. In both cases this implies access to, and the right to distribute, the relevant source code. A mere description of the algorithms used, or a mention of the name of a commercial software product, is not good enough to satisfy the demands of a meaningful attempt at replication.

The source code alone is sometimes not enough. Since the details of the results of a calculation can depend on the compiler, the entire chain from source to machine code needs to be free to ensure reproducibility. This condition is automatically met for languages like Julia, Python, and R, whose interpreters and compilers are free software. For C, C++, and Fortran, the other currently popular languages for simulation and analysis, this is only sometimes the case. To get the best performance from Fortran simulations, for example, scientists often use commercial compilers provided by chip manufacturers.

Document preparation and preservation

The forward march of science is recorded in papers which are collected on preprint servers (such as arXiv), on the home pages of scientists, and published in journals. It's obviously bad for science if future generations can't read these papers, or if a researcher can no longer open a manuscript after upgrading their word-processing software. Fortunately, the future readability of published papers is enabled by the adoption, by journals and preprint servers, of PDF as the universal standard format for the distribution of published work. This has been the case even with journals that request Microsoft Word files for manuscript submission.

PDF files are based on an open, versioned standard and will be readable into the foreseeable future with all of the formatting details preserved. This is essential in science, where communication is not merely through words but depends on figures, captions, typography, tables, and equations. Outside the world of scientific papers, HTML is by far the dominant markup language used for online communication. It has advantages over PDF in that simple documents take less bandwidth, HTML is more easily machine-readable and human-editable, and by default text flows to fit the reader's viewport. But this last advantage is an example of why HTML is not ideal for scientific communication: its flexibility means that documents can appear differently on different devices.

The final rendering of a web document is the result of interpretation of HTML and CSS by the browser. The display of mathematics typically depends on evolving JavaScript libraries, as well, so the author does not know whether the reader is seeing what was intended. The "P" in PDF stands for "portable": every reader sees the same thing, on every device, using the same fonts, which should be embedded into the file. The archival demands of the scientific record, combined with the typographic complexity often inherent to research papers, requires a permanent and portable electronic format that sets their appearance in stone.

To aid collaboration and to ensure that their work is widely readable now and in the future, scientists should distribute their articles in the form of PDF files, ideally alongside text-based source files. In mathematics and computer science, and to some extent in physics, LaTeX is the norm, so researchers in these fields will have the editable versions of their papers available as a matter of course. Biology and medicine have not embraced the culture of LaTeX; their journals encourage Word files (but often accept RTF output). Biologists working in Word should create copies of their drafts in one of Word's text-based formats, such as .docx or .odt; though these files may not be openable by future versions of Word, their contents will remain readable. Preservation of text-based, editable source files is essential for scientists, who often revise and repurpose their work, sometimes years after its initial creation.

Licensing problems

Commercial software practically always comes with some form of restrictive license. In contrast with free-software licenses, commercial ones typically interfere with the use of programs, which often throws a wrench into the daily work of scientists. The consequences can be severe; software that comes with a per-seat or similar type of license should be avoided unless there is no alternative.

One sad but common situation is that of a graduate student who becomes accustomed to a piece of expensive commercial analytical software (such as a symbolic-mathematics program), enjoying it either through a generous student discount or because it's paid for by the department. Then the freshly-minted PhD discovers the real price of the software, and can't afford it on their postdoc salary. They have to learn new ways of doing things, and have probably lost access to their past work, which is locked up in proprietary binary files.

A few months ago, an Elsevier engineering journal retracted two papers because their authors had used a commercial fluid-dynamics program without purchasing a license for it. The company behind the program regularly scans publications looking for mentions of its product in order to extract license fees from authors. In these cases, the papers had already been cited, so their retraction is disruptive to scholarship. Cases such as these are particularly clear examples of the potential damage to science (and to the careers of scientists) that can be caused by using commercial software.

In addition, certain commercial software products with per-seat licensing "call home" so that the companies that sell them can keep track of how many copies of their programs are in use. The security implications of this should be obvious to anyone, yet government organizations, while adhering minutely to security rituals with questionable efficacy, permit their installation. While working at a US Department of Defense (DoD) lab, I was an occasional witness to the semi-comical sight of someone running around knocking on office doors, trying to find out who was using (or had left running) a copy of the program that they desperately needed to use to meet some deadline—but were locked out of.

Software rot

Ideally scientists would only use free software, and would certainly avoid "black box" commercial software for the various reasons mentioned in this article. But there is another category that's less often spoken of: commercial software that provides access to its source code.

When I joined a new project at my DoD job, the engineer that I was supposed to work with was at a loss because a key software product had stopped working after he upgraded the operating system (OS) on his workstation. The operating system couldn't be downgraded and the company was no longer supporting the product. I got a thick binder from him with the manual and noticed a few floppy disks included. These contained the source code. Right at the top of the main program was a line that checked the version of the OS and exited if it was not within the range that the program was tested on. I figured we had nothing to lose, so edited this line to accept the current OS version. The program ran fine and we were back in business.

The point of this anecdote is to illustrate the practical value of access to source code. Such proprietary but source-available software occupies an intermediate position between free software and the black boxes that should be strictly avoided. Source-available software, although more transparent, practical, and useful than black boxes, still fails to satisfy the reproducibility criterion, however, because the scientist who uses it can't publish or distribute the source; therefore other scientists can't repeat the calculations.

Software recommendations

The following specific recommendations are for free software that's potentially of use to any scientist or engineer.

Scientists should, when practical, test their code using free compilers, and use these in preference to proprietary options when performance is acceptable. For the C family, GCC is the venerable standard, and produces performant code. A more recent but now equally capable option is Clang.

For Fortran, GFortran (which is a front-end for GCC) is a high-quality compiler and the standard free-software choice. Several more recently developed alternatives are built, as is Clang, on LLVM. To avoid potential confusion, two of these are called "Flang". Those interested in investigating an LLVM option should follow the project called (usually) "LLVM Flang", which is written from scratch in C++, and was renamed to "Flang" once it became part of the LLVM project in 2020. Its GitHub page warns that it is "not ready yet for production usage", but this is probably the LLVM Fortran compiler of the future. Another option to keep an eye on is the LFortran compiler. Although still in alpha, this project (also built on LLVM) is unique in providing a read-eval-print loop (REPL) for Fortran.

For those scientists not tied to an existing project in a legacy language, Julia is likely the best choice for simulation and analysis. It's an interactive, LLVM-based, high-level expressive language that provides the speed of Fortran. Its interfaces to R, gnuplot and Python mean that those who've put time into crafting data-analysis routines in those languages can continue to use their work.

Although LaTeX is beloved for the quality of its typesetting, especially for mathematics, it is less universally admired for the inscrutability of its error messages, the difficulty of customizing its behavior using its arcane macro language, and its ability to occasionally make simple things diabolically difficult. Recently a competitor to LaTeX has arisen that approaches that venerable program in the quality of its typography (it uses some of the same critical algorithms) while being far easier to hack on: Typst. Like LaTeX, Typst is free software that uses text files for its source format, though Typst does also have a non-free-software web application. Typst is still in alpha, and so far only one journal accepts manuscripts using its markup language, but its early adopters are enthusiastic.

A superb solution for the preparation of documents of all types is Pandoc, a Haskell program that converts among a huge variety of file formats and markup languages. Pandoc allows the author to write everything in its version of Markdown and convert into LaTeX, PDF, HTML, various Word formats, and more. Raw LaTeX, HTML, and others can be added into the Markdown source, so the fact that Markdown has no markup for mathematics (for example) is not an obstacle. The ability to have one source and automatically create a PDF and a web page, or to produce a Word file for a publication that insists on it without having to touch a "what you see is what you get" (WYSIWYG) abomination, greatly simplifies the life of the writer/scientist. Pandoc can even output Typst files, so those who use it are ready for that revolution if it comes.

Conclusion

The goals of the free-software movement include ensuring the ability of all users of software to form a community enriched and liberated by the right to study, modify, and redistribute code. The specific needs of the scientific community bring the benefits of free software into clear focus and they are critical to the health and continued progress of science.

The free-software movement has an echo in the "open-access movement", which is centered around scientific publication and began in the early 1990s. It has its origins in the desire of scientists to break free of the stranglehold of the commercial scientific publishers. Traditionally, those publishers have interfered with the free exchange of ideas, while extracting reviewer labor without compensation and attaching exorbitant fees to the access of scientific knowledge. Working scientists are aware of the movement, and most support its aims of providing free access to papers while preserving the curation and quality control inherited from traditional publishing. It is important to also continue to nourish awareness of the crucial role that free software plays throughout the scientific world.





Read Entire Article