First, let me paint you a few scenarios. You have images that you want to use in multiple places, like presentations, papers, and maybe even that PhD thesis you are planning on writing in a few years. However, the image format is going to vary from document to document; some documents, like presentations, call for much larger fonts, whereas your PhD thesis is going to have a clipped image size compared to any papers you'll publish. You want to be able to quickly reconfigure a set of images to have the same look and feel. You also want to view, easily, what processing you used on the image several years from now after you've forgotten everything you did.
Another scenario: a person is writing a book chapter and comes to you about using a variation of one of your images that she has seen in a paper of yours. She asks you to produce a variation of it.
Third scenario: one of your colleagues discovers a bug in an analysis script. You need to quickly track down all data that uses this script, and see if it affects any of your published data.
Fourth scenario: you're trying to remember exactly how you analyzed a set of data, but you performed the experiment several months back and can't remember the exact method you used to produce a figure in one of your presentations.
Fifth scenario: you're modifying a script that creates a figure, and you're concerned about comparing the different possibilities and still getting the final version correct. How do you track all the changes?
These scenarios illustrate three important ideas when publishing data: reproducibility, traceability, and configurability. I am going to share my method of (mostly) solving these problems using LaTeX and the make utility.
Make is a text-based utility that is normally used for dependency tracking in large programs. Back in the day, when compiling took a lot more time than it does now, it was important that code was only compiled when it was necessary. When a program called a library, it wasn't necessary to re-compile the library every time the program was changed. The make program is used to track these dependencies, but it is actually general enough to track a lot of dependencies, such as images in a LaTeX file.
Let me give you an example. Say you have a LaTeX file that includes two images: A.jpeg and B.jpeg. Both are created by eponymous scripts A.exe and B.exe. A.exe, though, relies on a lot of really complicated processing performing by complexProcess.exe. You also want to have the same look and feel for both images.
This is quite simple to accomplish if you are familiar with the make utility. You can create a Makefile that looks like the following, which tracks the dependencies of your data and processing:
A.jpeg: A.exe tempFileFromProcessingThis looks complicated, but it is quite simple. The Makefile has the structure
./A.exe
tempFileFromProcessing: complexProcess.exe experimentalData
./complexProcess.exe
B.jpeg: B.exe
./B.exe
viewPdf: pdf
acrobat paper.pdf
pdf: paper.tex paper.bib
latex paper
bibtex paper.bib
latex paper
latex paper
dvipdf paper.dvi
makeItem: dependencyOne dependencyTwo
commandOne
commandTwo
To create makeItem, make checks that the dependencies are met. If makeItem and the dependencies are files, then make checks if the dependency files were last modified after the makeItem file, and will only run the commands if this is the case. In the Makefile example, A.jpeg depends on A.exe and tempFileFromProcessing. If either A.exe or tempFileFromProcessing has been modified after A.jpeg, then make will run ./A.exe (which should create A.jpeg). Similarly, tempFileFromProcessing depends on the processing script, complexProcess.exe, and experimentalData. If either complexProcess.exe or experimentalData has changed after the temp file, the processing will be re-run. This allows you to cache or save complicated processing, but still gives traceability into what processing occurs.
Making the file B.jpeg is simpler to understand, as it just depends on B.exe. If B.exe has changed after B.jpeg, make will run ./B.exe, which should create B.jpeg.
The same thing can be done with tables or experimentally derived values by having LaTeX include another .tex file that is generated by a script.
Getting LaTeX and BibTeX to generate a file can be a bit complicated. Using make erases this problem, as it does the same thing every time. To make the pdf, you'd type "make pdf" which would cause make to run latex, then bibtex, then latex, then latex again. You could type "make viewPdf" which would also tell adobe acrobat to show you the PDF file.
If you need to see which images depend on a specific script, you can have the script print out caller information (or use a logging utility) to check what data is passing through any buggy scripts. Just change the script and re-make the pdf, and you'll have all the contaminated data calls.
Okay, so this fixes two problems, namely reproducibility and traceability. It also makes it easy to change the file and see the results. You can change which experimental run you use for experimentalData, for example, to see how your A.jpeg varies based on which run you're presenting. But what about configurability? How do you make the plots look the same for each document, but use the same plotting scripts across documents?
You can accomplish that by using a convention for look-and-feel in the figure generation files. For example, you can pass a file to all of your scripts that contains information like the font, font size, and figure size to create. This allows easy configurability.
So there you have it - the biggest strengths of LaTeX compared to programs like MSWord. You can create documents that use the same plotting scripts, but with different look-and-feels, for publications, presentations, posters and theses. It is easy to trace the source of plots when you need to come back to that publication in four years but you forgot the exact processing. You get caching of complex processing. Finally, it is easy to track down what published data has images run through contaminated scripts. The fact that LaTeX is compiled gives you superior traceability, configurability, and reproducibility of your data compared to WYSIWYG editors like MSWord.
No comments:
Post a Comment