Using Git and Github for LaTeX writing
The problematic
As an academic, I spend my life writing papers. Since these papers are mostly about math or some applications of math, I am an extensive user of latex.
I use Git and GitHub to manage these writings and obtain outputs (compiled) from them regularly, store different versions (e.g. during peer-review), discuss with co-authors around the code and the papers, etc..
This document summarizes my current workflows for latex automations and version control.
Isn’t version control used to write code and not latex documents ?
Yes, tools like SVN and git where developed to version control source code. But the latex code you are writing is also a source code – it compiles to a full document, but it has the same structure. In fact, TeX is a Turing complete language.
Git+latex is a wonderful workflow and idea. Beside standard git recommendations (make commit messages relevant, etc.), three guidelines are usually found online when talking about latex and git :
- Use a
.gitignore
file to avoid committing compilations artifacts that you do not want from LaTeX, including the.pdf
file. - Consider using line breaks in your document, putting each sentence on its own line instead of each paragraph. This will make your document “taller” and improve your git diff results and merge capacities since git is line-based.
- Use branches as you’ll use them for code: to develop a new idea, alter a theorem, etc. and keep on
master
the most finished version of the paper.
But some less evident things might also simplify drastically your life.
Journal template, arXiv export, etc. get their own branches.
I used branches for new features and potential ideas and changes that I’m not ready to merge into the main paper yet. We also use them to push pull-request around while writing with others: this is a very good collaborating scheme.
More specific to academic writting, I also use special branches to keep modifications that are specific to the publisher side of things apart from the rest.
Indeed, The Fancy Journal on Scarabs Reproduction
might request changes from you that are on the templating level: change documentclass
, add template files, change “Figure X” to “Fig. X” everywhere, use that convention instead of this one, etc. Each journal as its own guidelines and ca be picky on these stuffs.
Therefore, I keep these commits on a specific branch, the journalname
branch. Then, if the journal is unhappy and reject my work, I can just start over from master
and voila. If on the other hand they ask for reviews, I can still use a fast-forward merge.
ArXiv
is known to be a little picky to compile the files you give it – they have old versions of packages, of TexLive, etc. Therefore, they deserve their own branch with exactly the same structure, exactly for the same reason.
These journal branches are not supposed to be merged back into master
: I merge master
into all of them regularly (or when I need them), usually through rebase.
Semver tags are priceless.
I use tags a lot, and I follow some kind of semantic versioning major.minor.patch
- A
patch
increase represent an internal and non-adding change: Maybe I rewrote a section, or changed a proof because it was wrong, or something like that. I do not always use these tags. - A
minor
increase represent new content : I wrote a new theorem, a new proof, a new section, or I added a different example, a new property, a new remark, etc. major
increase are reserved for publication status : usually,v1.0
is the version that corresponds to the preprint, that I then branch out forArXiv
and to submit it to the journal, whilev2.0
correspond to the result after taking into account the first review. If there is no second round, this is also the last version (otherwise there will be av3.0
).
These semver tags can be leveraged to quickly refer to different versions of your work. Furthermore, they can also be used for automatic comparison between versions !
Exploiting tags with Latexdiff & GitHub CI.
The previous Semver rules can be leveraged to compile diffs, that are version fo the paper that clearly show what changes between two versions. This compilation can be done on github actions itself. A script to do it might look like this:
#!/bin/bash
git show $(git describe --tag HEAD^ --abbrev=0):paper.tex > tmp.tex
latexdiff -t UNDERLINE --verbose --no-links --flatten tmp.tex paper.tex > diff.tex
latexmk -pdf -f -interaction=nonstopmode diff.tex # Produces the diff pdf.
This simple code produces a diffed pdf version of the paper, that compares the currently checked out version to the previous major
tagged version. Of course, you could modify the first line to fetch the version of your choice.
Note that LatexDiff can be a pain sometimes, and you might need other exclusions rules depending on what type of latex features you are using. Finding the faulty part of your code can be complicated (git bisect
could help), which is why I recomand running this diff on each commit so that you catch problems early. You can do this through Github CI.
Automate latexdiff runs with Github CI
These days, I use the following Github workflow:
name: Build LaTeX document & latexdiff to previous tagged version.
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout new version
uses: actions/checkout@v2
with:
path: new
fetch-depth: 0
- name: Set last tag value
run: |
echo ::set-output name=LAST_TAG_NAME::"$(cd new && git describe --tag HEAD^ --abbrev=0)"
id: tag_val_stp
- name: Checkout old version
uses: actions/checkout@v2
with:
path: old
ref: ${{ steps.tag_val_stp.outputs.LAST_TAG_NAME }}
- name: Compile LaTeX document
uses: xu-cheng/latex-action@v2
with:
root_file: paper.tex
working_directory: new/
- name: Install latexdiff
run: sudo apt-get install latexdiff
- name: Run latexdiff
run: latexdiff -t UNDERLINE --verbose --no-links --flatten old/paper.tex new/paper.tex > new/diff.tex
- name: Compile LaTeX difference document
uses: xu-cheng/latex-action@v2
with:
root_file: diff.tex
working_directory: new/
- name: Check pdf files
run: |
file new/diff.pdf | grep -q ' PDF '
- name: Upload
uses: actions/upload-artifact@v2
with:
name: paper
path: |
new/diff.pdf
new/paper.pdf
This workflow allows to upload these files as artifacts, but also to ensure that the latexdiff always compile if i do not run it locally regularly enough. Another option is to publish as a website these produced files, so that you have fixed links to share the last version (even with a private repo), which can be usefull if people from your team do want to look at your manuscript but not contribute directly to it… e.g. PhD advisors for your PhD manuscript maybe ?
Version the code alongside the paper
Usually my papers contain some numeric applications, that relies on code : It might be R
code, Python
, Matlab
or htese days mostly Julia
.
I split this code in two parts: one part, that I publish somewhere else, is usually a kind of package that is supposed to be usable by others. But there is always a part that generates data and graphs directly for the paper. This part should not go inside the package, it should be versioned alongside the latex document.
I usually use a code/
folder, that produces things (images, tables..) inside an assets/
folder.
This ensures full reproducibility of the research. I do not generally make the source of the paper available before full publication, but when I do these files are along-side it and can be used to reproduce my results directly.
Add a CITATION.bib
file to the repo that is presented in the README.
The README of the repo might not be as useful as for a package project as not many people will see it, since we talk about privates repos.
However, a CITATION.bib
file, that contains the reference to this paper, can be added to the repo after it is made public.
Someday, I might write a script that goes through these paper repositories of mine and gaver these citations to populate my paper list… But not today. Today, I celebrate.