Introduction

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Why use version control?

Objectives

Understand the benefits of an automated version control system.

Understand the difference between Git and GitHub.

What is a version control system?

Version control is a piece of software which allows you to record and preserve the history of changes made to directories and files. If you mess things up, you can retrieve an earlier version of your project.

Why use a version control system?

[Piled Higher and Deeper by Jorge Cham,
http://www.phdcomics.com](http://www.phdcomics.com)

The comic above illustrates some of pitfalls of working without version control. Some of the benefits are given below:

Storing versions (properly)

Saving files after you have made changes should be an automatic habit. However if you want to have different versions of your code, you will need to save the new version somewhere else or with a different name.

Do you just save the file(s) you changed, or all the files in the project?
How do you name these different versions? It is very easy to lose track of what is what.
How do you know what is different between each version?

Without a VCS you will probably end up with lots of nearly-identical (but critically different) copies of the same file, which is confusing and wastes hard drive space. Your project will probably start to look like this:

A VCS treats your files as one project, so you only have one current version on your disk (the working copy) - all the other variants and previous versions are saved in the VCS repository. A VCS starts with a base version of your project and only saves the changes you make along the way, so it is much more space efficient too.

Add changes sequentially

Save different versions

Merge different versions

Restoring previous versions

The ability to restore previous versions of a file (or all the files in your project) greatly reduces the scope for screw ups. If you make changes which you later want to abandon (e.g. the wording of your conclusion section was better before you started making changes, your code changes end up breaking things which previously worked and you can’t figure out why etc), you can just undo them by restoring a previous version.

Understanding what happened

Each time you save a new version of your project, VCS requires you to give a description of why you made the changes. This helps identify which version is which.

Backup

For distributed version control like Git, each person working on the project has a complete copy of the project’s history (i.e. the repository) on their hard drive. This acts as a backup for the server hosting the remote repository.

Collaboration

Without VCS, you are probably using a shared drive and taking turns to edit files, or emailing files back and forth. This makes it really easy to overwrite or abandon someone else’s changes because you have to manually incorporate the other person’s changes into your version and vice versa.

With VCS, everyone is able to work on any file at any time without affecting anyone else. The VCS will then help you merge all the changes into a common version. It is also always clear where the most recent version is kept (in the repository).

Example scenario

Think about the following situation:

You are working on a handful of MATLAB files. You make a few changes, and then you want to try something you’re not quite confident about yet, so you save a copy in another folder just in case.

Then you want to try out the program with more data on a bigger machine, and you make a few changes there to get it working properly. Then you try out something else in the copy on your laptop.

Now you have three or four copies, all slightly different, and you have some results generated from all of them, and you include some of it in a paper.

Then someone asks for the same results based on a new data file. You have to go off and remind yourself which version you used, find out whether you still have it at all or whether you’ve changed it again since, check whether it really has the vital changes you thought you’d included but that might have been only on that other machine, and so on.

You should easily be able to see the benefits of VCS in the situation above.

What files can I track using version control?

VCS is typically used for software source code, but it can be used for any kind of text file:

Configuration files
Parameter sets
Data files
User documentation, manuals, and journal papers, whether they be plain-text, LaTeX, XML, md etc
Have a look at some of the projects on GitHub

Why should I avoid tracking binary files with version control?

It is possible to add binary files to a Git repository, but this is usually a bad idea:

diffs between versions become meaningless
binary files are often large, and thus slow down your repository
changes to binary files often required a whole new copy to be saved, so your repository can quickly grow in size

Strategies for dealing with large binary files are discussed here.

Git vs GitHub

For this session, we’ll be using Git, a popular distributed version control system and GitHub, a web-based service providing remote repositories. Distributed means that each user has a complete copy of the repository on their computer and can commit changes offline. If you have used a centralized version control system before e.g. Subversion, this will be one of the major differences to how you are used to working. See here for a more detailed comparison of Git and Subversion.

Key Points

Git is a version control tool; one of many.

GitHub is a repository hosting service; one of many.

Use version control to store versions neatly, restore previous versions, understand what happened (and why), and always know which is the current version.

Tracking changes with a local repository

Overview

Teaching: 35 min
Exercises: 0 min

Questions

How do I get started with Git?

Where does Git store information?

Objectives

Know how to set up a new Git repository.

Understand how to start tracking files.

Be able to commit changes to your repository.

Version control is centred round the notion of a repository which holds your directories and files. We’ll start by looking at a local repository. The local repository is set up in a directory in your local filesystem (local machine). For this we will use the command line interface.

Why use the command line?

There are lots of graphical user interfaces (GUIs) for using Git: both stand-alone and integrated into IDEs (e.g. MATLAB, Rstudio). We are deliberately not using a GUI for this course because:

you will have a better understanding of how the git comands work (some functionality is often missing and/or unclear in GUIs)

you will be able to use Git on any computer (e.g. remotely accessing HPC systems, which generally only have Linux command line access)

you will be able to use any GUI, rather than just the one you have learned

Setting up Git

Git is already installed on the training machines, whether you’re using Windows or Linux. Instructions for setting up Git on your own machine are given under setup.

Tell Git who we are

As part of the information about changes made to files Git records who made those changes. In teamwork this information is often crucial (do you want to know who rewrote your ‘Conclusions’ section?). So, we need to tell Git about who we are (note that you need to enclose your name in quote marks):

$ git config --global user.name "Your Name" 			# Put your quote marks around your name
$ git config --global user.email yourname@yourplace.org

Set a default editor

When working with Git we will often need to provide some short but useful information. In order to enter this information we need an editor. We’ll now tell Git which editor we want to be the default one (i.e. Git will always bring it up whenever it wants us to provide some information).

You can choose any editor available on your system. Choose one of the three options below

$ git config --global core.editor gedit				# Linux users only
$ git config --global core.editor notepad			# Windows users should use notepad
$ git config --global core.editor 'open -W -n'			# Mac users should use TextEdit

Git’s global configuration

We can now preview (and edit, if necessary) Git’s global configuration (such as our name and the default editor which we just set up). If we look in our home directory, we’ll see a .gitconfig file,

$ cat ~/.gitconfig

[user]
	name = Your Name
	email = yourname@yourplace.org
[core]
	editor = gedit

These global configuration settings will apply to any new Git repository you create on your computer. i.e. the --global commands above are only required once per computer.

Create a new repository with Git

We will be working with a simple example in this tutorial. It will be a paper that we will first start writing as a single author and then work on it further with one of our colleagues.

First, let’s create a directory within your home directory:

$ cd								# Switch to your home directory.
$ pwd								# Print working directory (output should be /home/<username>)
$ mkdir paper
$ cd paper

Now, we need to set up this directory up to be a Git repository (or “initiate the repository”):

$ git init

Initialized empty Git repository in /home/user/paper/.git/

The directory “paper” is now our working directory.

If we look in this directory, we’ll find a .git directory:

$ ls .git

branches  config  description  HEAD  hooks  info  objects refs

The .git directory contains Git’s configuration files. Be careful not to accidentally delete this directory!

Tracking files with a git repository

Now, we’ll create a file. Let’s say we’re going to write a journal paper, so we will start by adding the author names and a title, then save the file.

$ gedit paper.md				# Windows and Mac users see below for text editors available on your system
# Add author names and paper title

Text editors on your OS

This course is written for Linux, where gedit should be available. If you are using Windows, use notepad instead:
notepad paper.md
Similarly, if you are using macOS, use TextEdit:
open -e paper.md

Accessing files from the command line

In this lesson we create and modify text files using a command line interface (e.g. terminal, Git Bash etc), mainly for convenience. These are normal files which are also accessible from the file browser (e.g. Windows explorer), and by other programs.

git status allows us to find out about the current status of files in the repository. So we can run,

$ git status

On branch master

Initial commit

Untracked files:
(use "git add <file>..." to include in what will be committed)

paper.md

nothing added to commit but untracked files present (use "git add" to track)

Information about what Git knows about the directory is displayed. We are on the master branch, which is the default branch in a Git respository (one way to think of branches is like parallel versions of the project - more on branches later).

For now, the important bit of information is that our file is listed as Untracked which means it is in our working directory but Git is not tracking it - that is, any changes made to this file will not be recorded by Git.

Add files to a Git repository

To tell Git about the file, we will use the git add command:

$ git add paper.md
$ git status

On branch master

Initial commit

Changes to be committed:
(use "git rm --cached <file>..." to unstage)

	new file:   paper.md

Now our file is listed underneath where it says Changes to be committed.

git add is used for two purposes. Firstly, to tell Git that a given file should be tracked. Secondly, to put the file into the Git staging area which is also known as the index or the cache.

The staging area can be viewed as a “loading dock”, a place to hold files we have added, or changed, until we are ready to tell Git to record those changes in the repository.

The staging area

Commit changes

In order to tell Git to record our change, our new file, into the repository, we need to commit it:

$ git commit
# Type a commit message: "Add title and authors"
# Save the commit message and close your text editor (gedit, notepad etc.)

Our default editor will now pop up. Why? Well, Git can automatically figure out that directories and files are committed, and by whom (thanks to the information we provided before) and even, what changes were made, but it cannot figure out why. So we need to provide this in a commit message.

If we save our commit message and exit the editor, Git will now commit our file.

[master (root-commit) 21cfbde]
1 file changed, 2 insertions(+) Add title and authors
create mode 100644 paper.md

This output shows the number of files changed and the number of lines inserted or deleted across all those files. Here, we have changed (by adding) 1 file and inserted 2 lines.

Now, if we look at its status,

$ git status

On branch master
nothing to commit, working directory clean

our file is now in the repository. The output from the git status command means that we have a clean directory i.e. no tracked but modified files.

Now we will work a bit further on our paper.md file by starting the introduction section.

$ gedit paper.md
# Write introduction section

If we now run,

$ git status

we see changes not staged for commit section and our file is marked as modified:

On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)

     modified:	 paper.md

no changes added to commit (use "git add" and/or "git commit -a")

This means that a file Git knows about has been modified by us but has not yet been committed. So we can add it to the staging area and then commit the changes:

$ git add paper.md
$ git commit							# "Write introduction"

Note that in this case we used git add to put paper.md to the staging area. Git already knows this file should be tracked but doesn’t know if we want to commit the changes we made to the file in the repository and hence we have to add the file to the staging area.

It can sometimes be quicker to provide our commit messages at the command-line by doing git commit -m "Write introduction section".

In our introduction, we should cite a paper describing the main instrument used.

$ gedit paper.md 						# Cite instrument paper in introduction

Let’s also create a file refs.txt to hold our references:

$ gedit refs.txt					# Add the reference

Now we need to record our work in the repository so we need to make a commit. First we tell Git to track the references.

$ git add refs.txt					# Track the refs.txt file
$ git status							# Verify that refs.txt is now tracked

The file refs.txt is now tracked. We also have to add paper.md to the staging area. But there is a shortcut. We can use commit -a. This option means “commit all files that are tracked and that have been modified”.

$ git commit -am "Reference J Bloggs and add references file"	# Add and commit all tracked files

and Git will add, then commit, both the directory and the file.

In order to add all tracked files to the staging area, use git commit -a (which may be very useful if you edit e.g. 10 files and now you want to commit all of them).

The Git commit workflow

Key Points

git init initializes a new repository

git status shows the status of a repository

Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded)

git add puts files in the staging area

git commit saves the staged content as a new commit in the local repository

Always write a log message when committing changes

Looking at history and differences

Overview

Teaching: 30 min
Exercises: 15 (inc 10 for break) min

Questions

How do I get started with Git?

Where does Git store information?

Objectives

Be able to view history of changes to a repository

Be able to view differences between commits

Understand how and when to use tags to label commits

Looking at differences

We should reference some previous work in the introduction section. Make the required changes, save both files but do not commit the changes yet. We can review the changes that we made using:

$ gedit paper.md		# Cite previous studies in introduction
$ gedit refs.txt		# Add the reference to the database
$ git diff			# View changes

This shows the difference between the latest copy in the repository and the unstaged changes we have made.

- means a line was deleted.
+ means a line was added.
Note that a line that has been edited is shown as a removal of the old line and an addition of the updated line.

Looking at differences between commits is one of the most common activities. The git diff command itself has a number of useful options.

There is also a range of GUI-based tools for looking at differences and editing files. For example:

Diffmerge (Free, cross-platform)
WinMerge - open source tool available for Windows;
GitHub Compare view

Git can be configured to use graphical diff tools, and this is functionality is accessed using git difftool in place of git diff. Configuring a visual diff tool is covered on the hints and tips page. The choice of GUI for viewing differences depends on the context in which you are working and your own preferences related to choosing tools and technologies.

Now commit the change we made by adding the second reference:

$ git add paper.md refs.txt
$ git commit			# "Cite previous work in introduction"

Looking at our history

To see the history of changes that we made to our repository (the most recent changes will be displayed at the top):

$ git log

commit 8bf67f3862828ec51b3fdad00c5805de934563aa
Author: Your Name <your.name@manchester.ac.uk>
Date:	Mon Jun 26 10:22:39 2017 +0100

    Cite PCASP paper


commit 4dd7f5c948fdc11814041927e2c419283f5fe84c
Author: Your Name <your.name@manchester.ac.uk>
Date:	Mon Jun 26 10:21:48 2017 +0100

    Write introduction

commit c38d2243df9ad41eec57678841d462af93a2d4a5
Author: Your Name <your.name@manchester.ac.uk>
Date:	Mon Jun 26 10:14:30 2017 +0100

    Add author and title

The output shows (on separate lines):

the commit identifier (also called revision number) which uniquely identifies the changes made in this commit
author
date
your commit message

Git automatically assigns an identifier (e.g. 4dd7f5) to each commit made to the repository — we refer to this as COMMITID in the code blocks below. In order to see the changes made between any earlier commit and our current version, we can use git diff followed by the commit identifier of the earlier commit:

$ git diff COMMITID		# View differences between current version and COMMITID

And, to see changes between two commits:

$ git diff OLDER_COMMITID NEWER_COMMITID

Using our commit identifiers we can set our working directory to contain the state of the repository as it was at any commit. So, let’s go back to the very first commit we made,

$ git log
$ git checkout INITIAL_COMMITID

We will get something like this:

Note: checking out '21cfbdec'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 21cfbde... Add title and authors

This strange concept of the ‘detached HEAD’ is covered in the next section … just bear with me for now!

If we look at paper.md we’ll see it’s our very first version. And if we look at our directory,

$ ls

paper.md

then we see that our refs.txt file is gone. But, rest easy, while it’s gone from our working directory, it’s still in our repository. We can jump back to the latest commit by doing:

$ git checkout master

And refs.txt will be there once more,

$ ls

paper.md refs.txt

So we can get any version of our files from any point in time. In other words, we can set up our working directory back to any stage it was when we made a commit.

The `HEAD` and `master` pointers

HEAD is a reference, or pointer, which points to the branch at the commit where you currently are. We said previously that master is the default branch. But master is actually a pointer - that points to the tip of the master branch (the sequence of commits that is created by default by Git). You may think of master as two things:

a pointer
the default branch.

Before we checked out one of the past commits, the HEAD pointer was pointing to master i.e. the most recent commit of the master branch. After checking out one of the past commits, HEAD was pointing to that commit i.e. not pointing to master any more. That is what Git means by a ‘detached HEAD’ state and advises us that if we want to make a commit now, we should create a new branch to retain these commits.

Checking out a previous commit - detached head

If we created a new commit without first creating a new branch, i.e. working from the ‘detached HEAD’ these commits would not overwrite any of our existing work, but they would not belong to any branch. In order to save this work, we would need to checkout a new branch. To discard any changes we make from the detached HEAD state, we can just checkout master again.

Visualising your own repository as a graph

If we use git log with a couple of options, we can display the history as a graph, and decorate those commits corresponding to Git references (e.g. HEAD, master):

$ git log --graph --decorate --oneline

* 6a48241 (HEAD, master) Cite previous work in introduction
* ed26351 Cite PCASP paper
* 7446b1d Write introduction
* 4f572d5 Add title and author

Notice how HEAD and master point to the same commit. Now checkout a previous commit again, and look at the graph again. We can display, this time specifying that we want to look at --all the history, rather than just up to the current commit.

$ git checkout HEAD~				# This syntax refers to the commit before HEAD
$ git log --graph --decorate --oneline --all

* 6a48241 (master) Reference second paper in introduction
* ed26351 (HEAD) Reference Allen et al in introduction
* 7446b1d Write introduction
* 4f572d5 Add title and authors

Notice how HEAD no longer points to the same commit as master. Let’s return to the current version of the project by checking out master again.

$ git checkout master

Using tags as nicknames for commit identifiers

Commit identifiers are long and cryptic. Git allows us to create tags, which act as easy-to-remember nicknames for commit identifiers.

For example,

$ git tag PAPER_STUB

We can list tags by doing:

$ git tag

Let’s explain to the reader why this research is important:

$ gedit paper.md	# Give context for research
$ git add paper.md
$ git commit -m "Explain motivation for research" paper.md

We can checkout our previous version using our tag instead of a commit identifier.

$ git checkout PAPER_STUB

And return to the latest checkout,

$ git checkout master

Top tip: tag significant events

When do you tag? Well, whenever you might want to get back to the exact version you’ve been working on. For a paper, this might be a version that has been submitted to an internal review, or has been submitted to a conference. For code this might be when it’s been submitted to review, or has been released.

Where to create a Git repository?

Avoid creating a Git repository within another Git repository. Nesting repositories in this way causes the ‘outer’ repository to track the contents of the ‘inner’ repository - things will get confusing!

Exercise: “bio” Repository

Create a new Git repository on your computer called “bio”
Be sure not to create your new repo within the ‘paper’ repo (see above)
Write a three-line biography for yourself in a file called me.txt
Commit your changes
Modify one line, add a fourth line, then save the file
Display the differences between the updated file and the original

You may wish to use the faded example below as a guide

cd ..                # Navigate out of the paper directory
                     # Avoid creating a repo within a repo - confusion will arise!
mkdir ___            # Create a new directory called 'bio'
cd ___               # Navigate into the new directory
git ____             # Initialise a new repository
_____ me.txt         # Create a file and write your biography
git ___ me.txt       # Add your biography file to the staging area
git ______           # Commit your staged changes
_____ me.txt         # Edit your file
git ____ me.txt      # Display differences between your modified file and the last committed version

Solution

cd ..                # Navigate out of the paper directory
                     # Avoid creating a repo within a repo - confusion will arise!
mkdir bio            # Create a new directory
cd bio               # Navigate into the new directory
git init             # Initialise a new repository
gedit me.txt         # Create a file and write your biography
git add me.txt       # Add your biography file to the staging area
git commit           # Commit your staged changes
gedit me.txt         # Edit your file
git diff me.txt      # Display differences between your modified file and the last committed version

Key Points

git log shows the commit history

git diff displays differences between commits

git checkout recovers old versions of files

HEAD points to the commit you have checked out

master points to the tip of the master branch

Commit advice

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How, what, and when to commit?

What makes a good commit message?

Objectives

Understand what makes a good commit message

Know which types of files not to commit

Know when to commit changes

How to write a good commit message

Commit messages should explain why you have made your changes. They should mean something to others who may read them — including your future self in 6 months from now. As such you should be able to understand why something happened months or years ago.

Well written commit messages make reviewing code much easier, and more enjoyable. They also make interacting with the log easier — commands like blame, revert, rebase, and log.

Here is an excellent summary of best-practice, following established conventions. It’s well worth a read but the key points are given below:

Separate the subject from body with a blank line
Limit the subject line to 50 characters
Capitalize the subject line
Do not end the subject line with a period
Use the imperative mood in the subject line
Wrap the body at 72 characters
Use the body to explain what and why vs. how

Commit anything that cannot be automatically recreated

Typically we use version control to save anything that we create manually e.g. source code, scripts, notes, plain-text documents, LaTeX documents. Anything that we create using a compiler or a tool e.g. object files (.o, .a, .class, .pdf, .dvi etc), binaries (exe files), libraries (dll or jar files) we don’t save as we can recreate it from the source. Adopting this approach also means there’s no risk of the auto-generated files becoming out of sync with the manual ones.

We can automatically ignore such files using a .gitignore file. See hints and tips.

When to commit changes?

Commit frequently.
- There are no hard and fast rules, but good commits are atomic - they are the smallest change that remain meaningful.
- In the same way that it is wise to frequently save a document that you are working on, so too is it wise to save numerous revisions of your files. More frequent commits increase the granularity of your “undo” button.
- Small commits also help to avoid large merge conflicts.
Test before you commit
- Don’t commit changes until you’ve tested that your code works.
- Non-working code should be fixed before you commit.
Don’t commit unfinished work
- Break your code changes into small, but working chunks.
- If you need to temporarily save some work-in-progress (e.g. in order to work in another branch), use git stash – see hints and tips.
Commit related changes.
- Confine your commit to directly related changes. If you fix two separate bugs, you should have two separate commits.

Key Points

Commit messages explain why changes were made, so make them clear and concise

Follow conventions to give a history that is both useful, and easy to read

Only commit files which can’t be automatically recreated

Branching

Overview

Teaching: 25 min
Exercises: 15 min

Questions

What is a branch?

How can I merge changes from another branch?

Objectives

Know what branches are and why you would use them

Understand how to merge branches

Understand how to resolve conflicts during a merge

What is a branch?

You might have noticed the term branch in status messages:

$ git status

On branch master
nothing to commit (working directory clean)

and when we wanted to get back to our most recent version of the repository, we used git checkout master.

Not only can our repository store the changes made to files and directories, it can store multiple sets of these, which we can use and edit and update in parallel. Each of these sets, or parallel instances, is termed a branch and master is Git’s default branch.

A new branch can be created from any commit. Branches can also be merged together.

Why are branches useful?

Suppose we’ve developed some software and now we want to try out some new ideas but we’re not sure yet whether we’ll keep them. We can then create a branch ‘feature1’ and keep our master branch clean. When we’re done developing the feature and we are sure that we want to include it in our program, we can merge the feature branch with the master branch. This keeps all the work-in-progress separate from the master branch, which contains tested, working code.

When we merge our feature branch with master git creates a new commit which contains merged files from master and feature1. After the merge we can continue developing. The merged branch is not deleted. We can continue developing (and making commits) in feature1 as well.

Branching workflows

One popular model is the Gitflow model:

A master branch, representing a released version of the code
A release branch, representing the beginnings of the next release - a branch where the code is still undergoing testing
Various feature and/or developer-specific branches representing work-in-progress, new features, bug fixes etc

For example:

Feature branches ([image
source)](https://www.atlassian.com/git/tutorials/comparing-workflows#feature-branch-workflow)

There are different possible workflows when using Git for code development. If you want to learn more about different workflows with Git, have a look at this discussion on the Atlassian website.

Branching in practice

One of our colleagues wants to contribute to the paper but is not quite sure if it will actually make a publication. So it will be safer to create a branch and carry on working on this “experimental” version of the paper in a branch rather than in the master.

$ git checkout -b simulations

Switched to a new branch 'simulations'

We’re going to change the title of the paper and update the author list (adding John Smith). However, before we get started it’s a good practice to check that we’re working on the right branch.

$ git branch			# Double check which branch we are working on

  master
* simulations

The * indicates which branch we’re currently in. Now let’s make the changes to the paper.

$ gedit paper.md		# Change title and add co-author
$ git add paper.md
$ git commit			# "Modify title and add John as co-author"

If we now want to work in our master branch. We can switch back by using:

$ git checkout master

Switched to branch 'master'

Having written some of the paper, we have thought of a better title for the master version of the paper.

$ gedit paper.md		# Rewrite the title
$ git add paper.md
$ git commit			# "Include aircraft in title"

Merging and resolving conflicts

We are now working on two papers: the main one in our master branch and the one which may possibly be collaborative work in our “simulations” branch. Let’s add another section to the paper to write about John’s simulations.

$ git checkout simulations	# Switch branch
$ gedit paper.md		# Add 'simulations' section
$ git add paper.md
$ git commit -m "Add simulations" paper.md

At this point let’s visualise the state of our repo, and we can see the diverged commit history reflecting the recent work on our two branches:

git log --graph --all --oneline --decorate

* 89d5c6e (simulations) Add simulations
* 05d393a Change title and add coauthor
| * (HEAD, master) bdebbe0 Include aircraft in title
|/
* 87a65e6 Explain motivation for research
* 6a48241 Cite previous work in introduction
* ed26351 Cite PCASP paper
* 7446b1d Start the introduction
* 4f572d5 Add title and author

After some discussions with John we decided that we will publish together, hence it makes sense to now merge all that was authored together with John in branch “simulations”. We can do that by merging that branch with the master branch. Let’s try doing that:

$ git checkout master		# Switch branch
$ git merge simulations		# Merge simulations into master

Auto-merging paper.md
CONFLICT (content): Merge conflict in paper.md
Automatic merge failed; fix conflicts and then commit the result.

Git cannot complete the merge because there is a conflict - if you recall, after creating the new branch, we changed the title of the paper on both branches. We have to resolve the conflict and then complete the merge. We can get some more detail

$ git status

On branch master
You have unmerged paths.
  (fix conflicts and run "git commit")

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:	    paper.md

Let’s look inside paper.md:

# Title
<<<<<<< HEAD
Aircraft measurements of biomass burning aerosols over West Africa
=======
Simulations of biomass burning aerosols over West Africa
>>>>>>> simulations

The mark-up shows us the parts of the file causing the conflict and the versions they come from. We now need to manually edit the file to resolve the conflict. This means removing the mark-up and doing one of:

Keep the current version, which is the one marked-up by HEAD i.e. “Aircraft measurements of biomass burning aerosols over West Africa”
Keep the version from the other branch, which is the one marked-up by simulations i.e. “Simulations of biomass burning aerosols over West Africa”
Or manually edit the line to something new which might combine some elements of the two e.g. “Aircraft measurements and simulations of biomass burning aerosols over West Africa”

We edit the file. Then commit our changes:

$ gedit paper.md		# Resolve conflict by editing paper.md
$ git add paper.md		# Let Git know we have resolved the conflict
$ git commit

This is where version control proves itself better than DropBox or GoogleDrive, this ability to merge text files line-by-line and highlight the conflicts between them, so no work is ever lost.

We can see the two branches merged if we take another look at the log graph:

$ git log --graph --decorate --all --oneline

*   39cc80d (HEAD, master) Merge branch 'simulations'
|\
| * 89d5c6e (simulations) Add simulations
| * 05d393a Change title and add coauthor
* | bdebbe0 Include aircraft in title
|/
* 87a65e6 Explain motivation for research
* 6a48241 Cite previous work in introduction
* ed26351 Cite PCASP paper
* 7446b1d Start the introduction
* 4f572d5 Add title and author

Looking at our history - revisited

We already looked at “going back in time with Git”. But now we’ll look at it in more detail to see how moving back relates to branches and we will learn how to actually undo things. So far we were moving back in time in one branch by checking out one of the past commits.

But we were then in the “detached HEAD” state.

Add a commit to detached HEAD

Checkout one of the previous commits from our repository.

Make some changes and commit them. What happened?

Now try to run git branch. What can you see?
Solution
git checkout HEAD~1  # Check out the commit one before last
gedit paper.md     # Make some edits
git add paper.md   # Stage the changes
git commit		 # Commit the changes
git branch		 # You should see a message like the one below,
			 # indicating your commit does not belong to a branch
* (detached from 57289fb)
  master
You have just made a commit on a detached HEAD – as you can see from the output above, a new temporary branch has been created, which doesn’t have a name.

See this [detached HEAD animation] of the above process.

[detached HEAD animation]: https://learngitbranching.js.org/?NODEMO&command=git%20checkout%20HEAD~;git%20commit

Abandon the commit on a detached HEAD

You decide that you want to abandon that commit. How would you get back to the current version of your project?
Solution
git checkout master
Git will warn you that you are leaving behind changes that would be lost:

The output you see will be slightly different to that below, reflecting your previous commit message and commit ID.
Warning: you are leaving 1 commit behind, not connected to
any of your branches:

eb7c650 Add empty line for branching exercise

If you want to keep them by creating a new branch, this may be a good time
to do so with:

 git branch new_branch_name eb7c650

 Switched to branch 'master'
 Your branch is up-to-date with 'master'.
See this abandon detached HEAD animation.

Save your changes in a new branch

Preparation:

You should be on the master branch after that last exercise. If not, check out master again: git checkout master

Checkout one of the previous commits from your repository.

Make some changes, save the file(s), and make a commit on the detached HEAD as you did in the first exercise.

Run git branch to list your local branches, and see that you are on a temporary branch.

This time we want to keep the commit rather than abandon it.

Create a new branch and check it out.

Now run git log and see that your new commit belongs to this new branch.

List your local branches again and see that the temporary branch has gone.

Switch back to (i.e. checkout) the master branch
Solution
git checkout HEAD~1		# Checkout the commit before last
gedit paper.md		# Modify one of your files
git commit -a			# Commit all the modified files
git branch			# List local branches
* (HEAD detached from f908519)
 master
 simulations
You are currently on a temporary, unnamed branch, as indicated by the *.
git branch dh-exercise		# Create a new branch
git checkout dh-exercise	# Switch to the new branch
Switched to a new branch 'dh-exericise'
git branch			# View local branches
* dh-exericise
 master
 simulations
The commit you made on the detached HEAD now belongs to a named branch (dh-exercise in the example above), rather than a temporary branch.
git checkout master		# Switch back to the 'master' branch
See this new branch animation for the key points in this exercise.

Key Points

git branch creates a new branch

Use feature branches for new ideas and fixes, before merging into master

merging does not delete any branches

Lunch break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Key Points

Undoing changes

Overview

Teaching: 25 min
Exercises: 0 min

Questions

How can I discard unstaged changes?

How do I edit the last commit?

How can I undo a commit?

Objectives

Be able to discard unstaged changes

Be able to amend the most recent commit

Be able to discard all changes since a particular commit

Be able to undo the changes introduced by a commit

There are a number of things which we can amend and change after they have been commited in Git.

Discarding local changes

Maybe we made our change just to see how something looks, or to quickly try something out. But we may be unhappy with our changes. If we haven’t yet done a git add we can just throw the changes away and return our file to the most recent version we committed to the repository by using:

$ gedit paper.md		# Make some small edits to the file
$ git checkout paper.md		# Discard edits we just made

and we can see that our file has reverted to being the most up-to-date one in the repository:

$ git status			# See that we have a clean working directory
$ gedit paper.md		# Inspect file to verify changes have been discarded

Amending the most recent commit

If you just made a commit and realised that either you did it a bit too early and the files are not yet ready to be commited. Or, which is not as uncommon as you think, your commit message is not as it is supposed to be. You can fix that using the command git commit --amend

This opens up the default editor for Git which includes the previous commit message - you can edit it and close the editor. This will simply fix the commit message.

But what if we forgot to include some files in the commit?

Let’s try it on our example. First, let’s modify two files: our paper file and the references file. We will add a methodology section to the paper where we detail the model used for the simulations, and add a reference for this to the references file.

$ gedit paper.md		# Add methodology section, including a reference to model
$ gedit refs.txt		# Add new reference for the model used
$ git status			# Get a status update on file modifications

$ On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   refs.txt
	modified:   paper.md

no changes added to commit (use "git add" and/or "git commit -a")

Let’s then add and commit paper.md but not the references file.

$ git add paper.md		 # Add paper to staging area
$ git commit -m "Describe methodology"

Let’s have a look at our working directory now:

$ git status

$ On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   refs.txt

no changes added to commit (use "git add" and/or "git commit -a")

Also, run git log -2 to see what is the latest commit message and ID.

Now, we want to fix our commit and add the references file.

$ git add refs.txt	# Add reference file
$ git commit --amend		# Amend most recent commit

This will again bring up the editor and we can amend the commit message if required.

Now when we run git status and then git log we can see that our Working Directory is clean and that both files were added.

$ git status
$ git log -3

`git revert` (undo changes associated with a commit)

git revert removes the changes applied in a specified commit. However, rather than deleting the commit from history, git works out how to undo those changes introduced by the commit, and appends a new commit with the resulting content.

Let’s try it on our example. Modify the paper, describing the SMPS which is another instrument used to measure particle sizes, and then make a commit.

$ gedit paper.md		# Describe other instrument
$ git add paper.md
$ git commit -m "Describe SMPS"

We now realise that what we’ve just done in our journal article is incorrect because we are not using the data from that instrument. Some of the data got corrupted, and due to problems with the logging computer we are not going to use that data. So it makes sense to abandon the commit completely.

$ git revert HEAD		# Undo changes introduced by most recent commit

When we revert, a new commit is created. The HEAD pointer and the branch pointer are in fact moved forward rather than backwards.

We can revert any previous commit. That is, we can “abandon” any of the previous changes. However, depending on the changes we have made since, we may bump into a conflict (which we will cover in more detail later on). For example:

error: could not revert 848361e... Describe SMPS
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'

Behind the scenes Git gets confused trying to merge the commit HEAD is pointing to with the past commit we’re reverting.

So we have seen that git revert is a non-destructive way to undo a commit. What if we don’t want to keep a record of undoing commits? That would give a neater history. git reset can also be used to undo commits, but it does so by deleting history.

`git reset --hard` (restore a previous state by deleting history)

git reset has several uses, and is most often used to unstage files from the staging area i.e. git reset or git reset <file>.

We are going to use a variant git reset --hard <commit> to reset things to how they were at <commit>. This is a permanent undo which deletes all changes more recent than <commit> from your history. There is clearly potential here to lose work, so use this command with care.

Let’s try that on our paper, using the same example as before. Now we have two commits which we want to abandon: the commit outlining the unreliable instrumentation, and the subsequent revert commit. We can achieve this by resetting to the last commit we want to keep.

We can do that by running:

$ git reset --hard HEAD~2	# Move tip of branch to two commits before HEAD

HEAD is now at fbdc44b Add methodology section and update references file

This moves the tip of the branch back to the specified commit. If we look in-depth, this command moves back two pointers: HEAD and the pointer to the tip of the branch we currently are working on (master). (HEAD~ = the commit right before HEAD; HEAD~2 = two commits before HEAD)

The final effect is what we need: we abandoned the commits and we are now back to where we were before making the commit about the data we are not using.

Click for an animation of the revert and reset operations we just used.

This article discusses more in depth git reset showing the differences between the three options:

--soft
--mixed
--hard

Top tip: do not use git reset with remote branches

There is one important thing to remember about the reset command - it should only be used with branches that have not been shared yet (that is they haven’t been pushed into a remote repository that others are using). Resetting is changing the history without leaving trace. This is always a bad practice when using remote repositories and can lead to a horrible mess.

Reverting records the fact of “abandoning the commit” in the history. When we revert in a branch that is shared with others and then push that branch into the remote repository, it is as if we “came clean” about what we were doing. Everyone who pulls the branch in which we reverted changes will see it. With git reset we “keep it secret” that we have undone some changes.

As such, if we want to abandon changes in branches that are shared with others, we should to use the revert command.

Reset vs revert

See this Atlassian online tutorial for further reading about the differences between git revert and git reset.

How to undo almost anything with Git

See this blog post for more example scenarios and how to recover from them.

Key Points

git checkout <file> discards unstaged changes

git commit --amend allows you to edit the last commit

git revert undoes a commit, preserving history

git reset undoes a commit by deleting history

Working from multiple locations with a remote repository

Overview

Teaching: 25 min
Exercises: 0 min

Questions

What is a remote repository

How can I use GitHub to work from multiple locations?

Objectives

Understand how to set up remote repository

Understand how to push local changes to a remote repository

Understand how to clone a remote repository

We’re going to set up a remote repository that we can use from multiple locations. The remote repository can also be shared with colleagues, if we want to.

GitHub

GitHub is a company which provides remote repositories for Git and a range of functionalities supporting their use. GitHub allows users to set up their private and public source code Git repositories. It provides tools for browsing, collaborating on and documenting code. GitHub, like other services such as Launchpad, Bitbucket, GoogleCode, and SourceForge supports a wealth of resources to support projects including:

Time histories changes to repositories
Commit-triggered e-mails
Browsing code from within a web browser, with syntax highlighting
Software release management
Issue (ticket) and bug tracking
Download
Varying permissions for various groups of users
Other service hooks e.g. to Twitter.

Note GitHub’s free repositories have public licences by default. If you don’t want to share (in the most liberal sense) your stuff with the world and you want to use GitHub, you will need to pay for the private GitHub repositories (GitHub offers up to 5 free private repositories, if you are an academic - but do check this information as T&C may change).

GitHub for research

GitHub isn’t the only remote repositories provider. It is however very popular, in particular within the Open Source communities. The reason why we teach GitHub in this tutorial is mainly due to popular demand.

Also, GitHub has started working on functionality which is particularily useful for researchers such as making code citable.

Get an account

Let’s get back to our tutorial. We will first need a GitHub account.

Set up SSH keys

SSH is an encrypted network protocol which we will use to securely access our remote repository. In order to use it, we need to set up a pair of SSH keys, which are used together to validate access. There’s a private key, and a public key - GitHub needs to know the public key, but the private key stays only on your computer. A useful analogy is to think of the public key as a padlock, and the private key as the only key to the padlock.

Create ssh keys

Let’s first check whether we already have ssh keys set up:

$ ls ~/.ssh

If you already have ssh keys set up, your output will look something like this:

id_ed25519  id_ed25519.pub

and you can jump to the final step.

If you still need to set up ssh keys, you’ll get a message like this:

ls: cannot access '/home/yourusername/.ssh': No such file or directory

To set up the key pair, we use the following command

$ ssh-keygen -t ed25519 -C "your_email@example.com"

You might get an error from this if your system doesn’t support the ed25519 algorithm, in which case you can try $ ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/you/.ssh/id_ed25519):

Accept the default option using Enter.

Created directory  '/home/you/.ssh'.
Enter passphrase (empty for no passphrase):

Enter a password (you’ll be prompted to enter it twice)

Your identification has been saved in /home/you/.ssh/id_ed25519
Your public key has been saved in /home/you/.ssh/id_ed25519.pub
The key fingerprint is:
SHA256:SMSPIStNyA00KPxuYu94KpZgRAYjgt9g4BA4kFy3g1o your_email@example.com
The key's randomart image is:
+--[ED25519 256]--+
|^B== o.          |
|%*=.*.+          |
|+=.E =.+         |
| .=.+.o..        |
|....  . S        |
|.+ o             |
|+ =              |
|.o.o             |
|oo+.             |
+----[SHA256]-----+

Now that we have generated the SSH keys, we will find the SSH files when we check.

$ ls ~/.ssh

id_ed25519  id_ed25519.pub

We can view the public key using

$ cat ~/.ssh/id_ed25519.pub

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDmRA3d51X0uu9wXek559gfn6UFNF69yZjChyBIU2qKI your_email@example.com

Now you should copy the output from this command ready for the final step.

Add public ssh key to GitHub

The final step is to add the public key to our GitHub accounts.

On GitHub, click on your profile icon in the top right corner
Click “Settings,” then on the settings page
Click “SSH and GPG keys”
Click the “New SSH key” button on the right side.
Add a title e.g. “my_work_laptop” and paste your SSH key into the field, and click the “Add SSH key” to complete the setup.

Set the default GitHub branch name to ‘master’

As we saw in episode 2, the default branch name in a git repo is master.

In 2021 GitHub and many other remote repo providers changed their settings so that new repositories will use main instead of master. As ever there are arguments for and against this change. We can however choose the default branch name in our GitHub settings, so let’s set it to master to be consistent with the git software itself.

On GitHub, click on your profile photo at the top right of the page. Then go to Settings -> Repositories -> Repository default branch.

Change ‘main’ to ‘master’ and click ‘update’.

Create a new repository

Now, we can create a repository on GitHub,

Log in to GitHub
Click on the Create icon on the top right
Enter Repository name: “paper”
For the purpose of this exercise we’ll create a public repository
Make sure that Initialize this repository with a README is unselected
Click Create Repository

You’ll get a page with new information about your repository. We already have our local repository and we will be pushing it to GitHub using SSH, so this is the option we will use:

$ git remote add origin git@github.com:<USERNAME>/paper.git
$ git push -u origin master

The first line sets up an alias origin, to correspond to the URL of our new repository on GitHub.

Push locally tracked files to a remote repository

Now copy and paste the second line,

$ git push -u origin master

Counting objects: 32, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (28/28), done.
Writing objects: 100% (32/32), 3.29 KiB | 0 bytes/s, done.
Total 32 (delta 7), reused 0 (delta 0)
To https://github.com/gcapes/paper.git
 * [new branch]      master -> master
Branch master set up to track remote branch master from origin.

This pushes our master branch to the remote repository, named via the alias origin and creates a new master branch in the remote repository.

Now, on GitHub, we should see our code and if we click the Commits tab we should see our complete history of commits.

Our local repository is now available on GitHub. So, anywhere we can access GitHub, we can access our repository.

Push other local branches to a remote repository

Let’s push each of our local branches into our remote repository:

$ git push origin branch_name

The branch should now be created in our GitHub repository.

To list all branches (local and remote):

$ git branch -a

Deleting branches (for information only)

Don’t do this now. This is just for information. To delete branches, use the following syntax:
$ git branch -d <branch_name>			# For local branches
$ git push origin --delete <branch_name>	# For remote branches

Cloning a remote repository

Now that we have a copy of the repo on GitHub, we can download or git clone a fresh copy to work on from another computer.

So let’s pretend that the repo we’ve been working on so far is on a PC in the office, and you want to do some work on your laptop at home in the evening.

Before we clone the repo, we’ll navigate up one directory so that we’re not already in a git repo.

cd ..

Then to clone the repo into a new directory called laptop_paper

$ git clone https://github.com/<USERNAME>/paper.git laptop_paper

Cloning into 'laptop_paper'...
remote: Counting objects: 32, done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 32 (delta 7), reused 32 (delta 7), pack-reused 0
Unpacking objects: 100% (32/32), done.
Checking connectivity... done.

Cloning creates an exact copy of the repository. By deafult it creates a directory with the same name as the name of the repository. However, we already have a paper dircectory, so have specified that we want to clone into a new directory laptop_paper.

Now, if we cd into laptop_paper we can see that we have our repository,

$ cd laptop_paper
$ git log

and we can see our Git configuration files too:

$ ls -A

In order to see the other branches locally, we can check them out as before:

$ git branch -r					# Show remote branches
$ git checkout simulations			# Check out the simulations branch

Push changes to a remote repository

We can use our cloned repository just as if it was a local repository so let’s add a results section and commit the changes.

$ git checkout master				# We'll continue working on the master branch
$ gedit paper.md				# Add results section
$ git add paper.md				# Stage changes
$ git commit

Having done that, how do we send our changes back to the remote repository? We can do this by pushing our changes,

$ git push origin master

If we now check our GitHub page we should be able to see our new changes under the Commit tab.

To see all remote repositories (we can have multiple!) type:

$ git remote -v

Key Points

Git is the version control system: GitHub is a remote repositories provider.

git clone to make a local copy of a remote repository

git push to send local changes to remote repository

Collaborating with a remote repository

Overview

Teaching: 25 min
Exercises: 15 min

Questions

How do I update my local repository with changes from the remote?

How can I collaborate using Git?

Objectives

Understand how to pull changes from remote repository

Understand how to resolve merge conflicts

Pulling changes from a remote repository

Having a remote repository means we can share it and collaborate with others (or even just continue to work alone but from multiple locations). We’ve seen how to clone the whole repo, so next we’ll look at how to update our local repo with just the latest changes on the remote.

We were in the laptop_paper directory at the end of the last episode, having pushed one commit to the remote. Let’s now change directory to the other repository paper, and git pull the commit from the remote.

$ cd ../paper
$ git pull origin master

We can now view the contents of paper.md and check the log to confirm we have the latest commit from the remote:

$ git log -2

Still in the paper directory, let’s add a figures section to paper.md, commit the file and push these changes to GitHub:

$ gedit paper.md		# Add figures section
$ git add paper.md
$ git commit -m "Add figures"
$ git push

Now let’s change directory to our other repository and fetch the commits from our remote repository,

$ cd ../laptop_paper		# Switch to the other directory
$ git fetch

git fetch doesn’t change any of the local branches, it just gets information about what commits are on the remote branches.

We can visualise the remote branches in the same way as we did for local branches, so let’s draw a network graph before going any further:

git log --graph --all --decorate --oneline

* 7c239c3 (origin/master, origin/HEAD) Add figures
* 0cc2a2d (HEAD -> master) Discuss results
* 3011ee0 Describe methodology
*   6420699 Merge branch 'simulations'
|\
| * 7138785 (origin/simulations) Add simulations
| * e695fa8 Change title and add coauthor
* | e950911 Include aircraft in title
|/
* 0b28b0a Explain motivation for research
* 7cacba8 Cite previous work in introduction
* 56781f4 Cite PCASP paper
* 5033467 Start the introduction
* e08262e Add title and author

As expected, we see that the origin/master branch is ahead of our local master branch by one commit — note that the history hasn’t diverged, rather our local branch is missing the most recent commit on origin/master.

We can now see what the differences are by doing,

$ git diff origin/master

which compares our master branch with the origin/master branch which is the name of the master branch in origin which is the alias for our cloned repository, the one on GitHub.

We can then merge these changes into our current repository, but given the history hasn’t diverged, we don’t get a merge commit — instead we get a fast-forward merge.

$ git merge origin/master

Updating 0cc2a2d..7c239c3
Fast-forward
 paper.md | 4 ++++
 1 file changed, 4 insertions(+)

If we look at the network graph again, all that has changed is that master now points to the same commit as origin/master.

git log --graph --all --decorate --oneline -4

* 7c239c3 (HEAD -> master, origin/master, origin/HEAD) Add figures
* 0cc2a2d Discuss results
* 3011ee0 Describe methodology
*   6420699 Merge branch 'simulations'

We can inspect the file to confirm that we have our changes.

$ cat paper.md

So we have now used two slightly different methods to get the latest changes from the remote repo. You may already have guessed that git pull is a shorthand for git fetch followed by git merge.

Fetch vs pull

If git pull is a shortcut for git fetch followed by git merge then, why would you ever want to do these steps separately?

Well, depending on what the commits on the remote branch contain, you might want to abandon your local commits before merging (e.g. your local commits duplicate the changes on the remote), rebase your local branch to avoid a merge commit, or something else.

Fetching first lets you inspect the changes before deciding what you want to do with them.

Let’s write the conclusions:

$ gedit paper.md		# Write Conclusions
$ git add paper.md
$ git commit -m "Write Conclusions" paper.md
$ git push origin master
$ cd ../paper			# Switch back to the paper directory
$ git pull origin master	# Get changes from remote repository

This is the same scenario as before, so we get another fast-forward merge.

We can check that we have our changes:

$ cat paper.md
$ git log

Conflicts and how to resolve them

Let’s continue to pretend that our two local repositories are hosted on two different machines. You should still be in the original paper folder. Add an affiliation for each author. Then push these changes to our remote repository:

$ gedit paper.md		# Add author affiliations
$ git add paper.md
$ git commit -m "Add author affiliations"
$ git push origin master

Now let us suppose, at a later date, we use our other repository (on the laptop) and we want to change the order of the authors.

The remote branch origin/master is now ahead of our local master branch on the laptop, because we haven’t yet updated our local branch using git pull.

$ cd ../laptop_paper		# Switch directory to other copy of our repository
$ gedit paper.md		# Change order of the authors
$ git add paper.md
$ git commit -m "Change the first author" paper.md
$ git push origin master

To https://github.com/<USERNAME>/paper.git
 ! [rejected]	     master -> master (fetch first)
error: failed to push some refs to 'https://github.com/<USERNAME>/paper.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Our push fails, as we’ve not yet pulled down our changes from our remote repository. Before pushing we should always pull, so let’s do that…

$ git pull origin master

and we get:

Auto-merging paper.md
CONFLICT (content): Merge conflict in paper.md
Automatic merge failed; fix conflicts and then commit the result.

As we saw earlier, with the fetch and merge, git pull pulls down changes from the repository and tries to merge them. It does this on a file-by-file basis, merging files line by line. We get a conflict if a file has changes that affect the same lines and those changes can’t be seamlessly merged. We had this situation before in the branching episode when we merged a feature branch into master. If we look at the status,

$ git status

we can see that our file is listed as Unmerged and if we look at paper.md, we see something like:

<<<<<<< HEAD
Author
G Capes, J Smith
=======
author
J Smith, G Capes
>>>>>>> 1b55fe7f23a6411f99bf573bfb287937ecb647fc

The mark-up shows us the parts of the file causing the conflict and the versions they come from. We now need to manually edit the file to resolve the conflict. Just like we did when we had to deal with the conflict when we were merging the branches.

We edit the file. Then commit our changes. Now, if we push …

$ gedit paper.md		# Edit file to resolve merge conflict
$ git add paper.md		# Stage the file
$ git commit			# Commit to mark the conflict as resolved
$ git push origin master

… all goes well. If we now go to GitHub and click on the “Overview” tab we can see where our repository diverged and came together again.

This is where version control proves itself better than DropBox or GoogleDrive, this ability to merge text files line-by-line and highlight the conflicts between them, so no work is ever lost.

We’ll finish by pulling these changes into other copy of the repo, so both copies are up to date:

$ cd ../paper			# Switch to 'paper' directory
$ git pull origin master	# Merge remote branch into local

Collaborating on a remote repository

In this exercise you should work with a partner or a group of three. One of you should give access to your remote repository on GitHub to the others (by selecting Settings -> Manage access -> Invite a collaborator). The invited person should then check their email to accept the invitation.

Now those of you who are added as collaborators should clone the repository of the first person on your machines. (make sure that you don’t clone into a directory that is already a repository!)

Each of you should now make some changes to the files in the repository e.g. fix a typo, add a file containing supplementary material. Commit the changes and then push them back to the remote repository. Remember to pull changes before you push.

Creating branches and sharing them in the remote repository

Working with the same remote repository, each of you should create a new branch locally and push it back to the remote repo.

Each person should use a different name for their local branch. The following commands assume your new branch is called my_branch, and your partner’s branch is called their_branch — you should substitute the name of your new branch and your partner’s new branch.
$ git checkout -b my_branch		# Create and check out a new branch.
					# Substitute your local branch name for 'my_branch'.
Now create/edit a file (e.g. fix a typo, add supplementary material etc), and then commit your changes.
$ git push origin my_branch		# Push your new branch to remote repo.
The other person should check out local copies of the branches created by others (so eventually everybody should have the same number of branches as the remote repository).

To fetch new branches from the remote repository (into your local .git database):
$ git fetch origin
Counting objects: 3, done.  remote:
Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 2 (delta 0) Unpacking objects: 100% (3/3), done.
From	https://github.com/gcapes/paper
9e1705a..640210a master -> origin/master
* [new branch] their_branch -> origin/their_branch
Your local repository should now contain all the branches from the remote repository, but the fetch command doesn’t actually update your local branches.

The next step is to check out a new branch locally to track the new remote branch.
$ git checkout their_branch
Branch their_branch set up to track remote branch their_branch from origin.
Switched to a new branch 'their_branch'

Undoing changes using revert

Once you have the branches which others created, try to undo one of the commits.

Each one of you should try to revert a commit in a different branch to your partner(s).

Push the branch back to the remote repository. The others should pull that branch to get the changes you made.

What is the end result? What happens when you pull the branch that your colleagues changed using git revert?

Solution

The revert shows up in everyone’s copy. You should always use revert to undo changes which have been shared with others.

Key Points

git pull to integrate remote changes into local copy of repository

Rebasing

Overview

Teaching: 25 min
Exercises: 0 min

Questions

What is rebasing?

Objectives

Understand what is meant by rebasing

Understand the difference between merging and rebasing

When (and when not) to rebase

We were in the paper directory at the end of the last episode, which is where this episode continues.

Let’s review the recent history of our project, noting particularly the commit message which results when origin/master and master diverge, and origin/master is merged back into master.

$ git log --graph --all --oneline --decorate -6

*   365748e (HEAD -> master, origin/master, origin/HEAD) Merge branch 'master' of github.com:gcapes/paper
|\
| * ff18da4 Add author affiliations
* | 8f44540 Change first author
|/
* 8494909 Write conclusions
* e90a501 Add figures
* 3011ee0 Discuss results

Normally a merge commit indicates that a feature branch has been completed, a bug has been fixed, or marks a release version of our project. Our most recent merge commit doesn’t mark any real milestone in the history of the project — all it tells us is that we didn’t pull before we tried to push. Merge commits like this don’t add any real value¹, and can quickly clutter the history of a project.

If only there were a way to avoid them, e.g. by starting with the tip of the remote branch and reapplying our local commits from this new starting point. You could also describe this as moving the local commits onto a new base commit i.e. rebasing.

What is it?

Rebasing is the process of moving a whole branch to a new base commit. Git takes your changes, and “replays” them onto the new base commit. This creates a brand new commit for each commit in the original branch. As such, your history is rewritten when you rebase.

It’s like saying “add my changes to what has already been done”.

Visual illustration of rebasing - image taken from
[https://www.atlassian.com/git/tutorials/rewriting-history/git-rebase](https://www.atlassian.com/git/tutorials/rewriting-history/git-rebase)

How’s that different to merging?

Imagine you create a new feature branch to work in, and meanwhile there have been commits added to the master branch, as shown below.

You’ve finished working on the feature, and you want to incorporate your changes from the feature branch into the master branch. You could merge directly or rebase then merge. We have already encountered merging, and it looks like this:

The main reason you might want to rebase is to maintain a linear project history. In the example above, if you merge directly (recall that there are new commits on both the master branch and feature branch), you have a 3-way merge (common ancestor, HEAD and MERGE_HEAD) and a merge commit results. Note that you get a merge commit whether or not there are any merge conflicts.

If you rebase, your commits from the feature branch are replayed onto master, creating brand new commits in the process. If there are any merge conflicts, you are prompted to resolve these.

After rebasing, you can then perform a fast-forward merge into master i.e. without an extra merge commit at the end, so you have a nice clean linear history.

Why would I consider rebasing?

Rebase and merge solve the same problem: integrating commits from one branch into another. Which method you use is largely personal preference.

Some reasons to consider rebasing:

To give a linear project history, which is easier to follow
- This makes using git log, and git bisect easier
To integrate upstream changes into your local repository, without creating any merge commits
To keep a feature branch up to date with master, without polluting your feature branch with extraneous merge commits
Makes pull requests easier to manage (because you’ve already resolved any merge conflicts while rebasing)
To tidy up a feature branch before merging into master (requires interactive rebase)

Interactive rebasing

git rebase -i will open an interactive rebasing session. This provides an opportunity to edit, delete, combine, and reorder individual commits as they are moved onto the new base commit. This can be useful for cleaning up history before sharing it with others.

A worked example using `git rebase <base>`

We’ll repeat the scenario from the last episode where the local and remote branches diverge, but instead of merging the remote branch origin/master into master, we’ll rebase master onto origin/master.

We’ll write some acknowledgements, then commit and push.

$ gedit paper.md				# Write acknowledgements
$ git add paper.md
$ git commit -m "Write acknowledgements section"
$ git push origin master			# Push master branch to remote

We’ll now switch machine to our laptop, and write the abstract:

$ cd ../laptop_paper				# Pretend we're on the laptop
$ gedit paper.md				# Add abstract section
$ git add paper.md
$ git commit					# "Write abstract"

At this point we can view a graph of project history, and see where the master branch diverges from origin/master:

$ git fetch					# Retrieve information about remote branches
$ git log --graph --all --oneline --decorate	# View project history before rebasing

* 21cfe5f (HEAD -> master) Write abstract
| * 13aa7e3 (origin/master, origin/HEAD) Add acknowledgements
|/
*   365748e Merge branch 'master' of github.com:gcapes/paper
|\
| * ff18da4 Add author affiliations
* | 8f44540 Change first author
|/
* 8494909 Add figures

As before, if we try to push our local branch, it will fail — git will suggest that we pull in order to merge the remote commit into our local branch, before pushing again. We did that in the last episode, which resulted in a ‘forgot-to-pull’ merge commit. This time we will replay our local branch onto to the remote branch.

$ git rebase origin/master			# Rebase current branch onto origin/master

Note that this syntax only works because we just did a git fetch. Typically, you would use git pull --rebase instead, which combines the fetch and rebase steps.

Merge conflicts during a rebase

Depending what changes we have made, there may be conflicts we have to fix in order to rebase. If this is the case, Git will let us know, and give some instructions on how to proceed. The process for fixing conflicts is the same as before:
$ gedit file					# Manually fix conficts in affected file(s)
$ git add file					# Mark file(s) as resolved
$ git rebase --continue				# Continue to rebase

Let’s now visualise our project history again, having rebased master onto origin/master, and observe that we now have a linear project history. Rebasing has created a new commit (with a new commit ID) and put it on top of the commit pointed at by origin/master — thus avoiding that forgot-to-pull merge commit!

$ git log --graph --all --oneline --decorate	# View project history after rebasing

* 6105e61 (HEAD -> master) Write abstract
* 13aa7e3 (origin/master, origin/HEAD) Add acknowledgements
*   365748e Merge branch 'master' of github.com:gcapes/paper
|\
| * ff18da4 Add author affiliations
* | 8f44540 Change first author
|/
* 8494909 Add figures

Having integrated the remote changes into our local branch, we can now push our local branch back to ‘origin’.

$ git push origin master

This online tutorial gives a good illustration of what happens during rebasing.

Warning: the perils of rebasing

The main rule is: do not rebase branches shared with other contributors. Rebasing changes history and as with practically any Git command which changes history, it should be used with care.

The branches that are pushed to remote repositories should always be merged. For your local branches that you never share, you may use rebasing. Rebasing is convenient if you want to keep a clean history. It also helps to avoid conflicts in the long run. But again, it is considered a better practice to use merge and deal with conflicts rather than mess up shared branches using rebase.

This statement contains elements of opinion. ↩

Key Points

rebase applies your changes on top of a new base (parent) commit

rebasing rewrites history

Pull Requests

Overview

Teaching: 5 min
Exercises: 10 min

Questions

How can I contribute to a repository to which I don’t have write access?

Objectives

Understand what it means to fork a repository

Be able to fork a repository on GitHub

Understand how to submit a pull request

Pull Requests are a great solution for contributing to repositories to which you don’t have write access. Adding other people as collaborators to a remote repository is a good idea but sometimes (or even most of the time) you want to make sure that their contributions will provide more benefits than the potential mistakes they may introduce.

In large projects, primarily Open Source ones, in which the community of contributors can be very big, keeping the source code safe but at the same allow people to make contributions without making them “pass” tests for their skills and trustworthiness may be one of the keys to success.

Leveraging the power of Git, GitHub provides a functionality called Pull Requests. Essentially it’s “requesting the owner of the repository to pull in your contributions”. The owner may or may not accept them. But for you as a contributor, it was really easy to make the contribution.

The process

Find a repository on GitHub that belongs to someone else
Fork it (git clone it on GitHub’s servers into your GitHub account)
git clone it to your PC/laptop
Create a new branch
Make changes, and push them to your repository on GitHub
Request that the owner of the repository you forked pulls in your changes

Conceptual illustration of a pull request - image adapted from
[here](http://acrl.ala.org/techconnect/post/coding-collaboration-on-github)

Advice for submitting Pull Requests

Keep your Pull Request small and focussed (makes it easier to process)
- Submit one PR per issue
- Create a separate branch for each issue you work on (you can submit a PR from any branch)
R.T.F.M.
- If the repository has contributing guidelines, read them, and follow the guidance. This gives your PR a better chance of being accepted.
- Some repositories pre-populate the body of the PR or issue message with a template.
  - Follow the instructions (e.g. provide the information requested)
Consider creating a new issue first to discuss your ideas before submitting a PR. Some repositories ask for this in their contributing guidelines, but this can be a good approach even if it isn’t required, so that you know whether the owner agrees with your suggestion, and might bring up ideas and/or challenges you haven’t considered.

After submitting your pull request

If things go well, your PR may get merged just as it is. However, for most PRs, you can expect some discussion (on GitHub) and a request for further edits to be made. Given your changes haven’t been merged get, you can make changes either by adding further commits to your branch and pushing them, or you could consider rewriting your history neatly using an interactive rebase onto an earlier commit. In either case, your PR will update automatically once you have pushed your commits.

Exercise

Let’s look at the workflow and try to repeat it:
Fork this repository by clicking on the Fork button at the top of the page.
Clone the repository from YOUR GitHub account. When you run git remote -v you should get something like this:
 origin	https://github.com/YOUR_USERNAME/manchester-paper.git(fetch)
 origin	https://github.com/YOUR_USERNAME/manchester-paper.git(push)
cd into the directory you just cloned. Create a new branch, then make changes you want to contribute. Commit and push them back to your repository. You won’t be able to push back to the repository you forked from because you are not added as a contributor!

Go to your GitHub account and in the forked repository find a green button for creating Pull Requests. Click it and follow the instructions.

The owner of the original repository gets a notification that someone created a pull request - the request can be reviewed, commented and merged in (or not) via GitHub.

Key Points

A fork is a git clone into your (GitHub) account

A pull request asks the owner of a repository to incorporate your changes

Git hints and tips

Overview

Teaching: 15 min
Exercises: 0 min

Questions

How can I find help?

How can I further customise Git to suit my preferences?

Objectives

Access help online and from command prompt

Configure Git to ignore certain types of files

Understand how to build a commit selectively using git add --patch

Getting help

`man` page

Like many Unix/Linux commands, git has a man page,

$ man git

You can scroll the manual page up and down using the up and down arrows.

You can search for keywords by typing / followed by the search term e.g. if interested in help, type /help and then hit enter.

To exit the manual page, type q.

Command-line help

Type,

$ git --help

and Git gives a list of commands it is able to help with, as well as their descriptions.

You can get more help on a specific command, by providing the command name e.g.

$ git init --help
$ git commit --help

Google

Search for your problem online. Someone has probably already asked (and answered) your question on stackoverflow.com.

Ignore scratch, temporary and binary files

You can create a .gitignore file which lists the patterns of files you want Git to ignore. It’s common practice to not add to a repository any file you can automatically create in some way e.g. C object files (.o), Java class (.class) files or temporary files e.g. XEmacs scratch files (~). Adding these to .gitignore means Git won’t complain about them being untracked.

Create or edit gitignore,

$ gedit .gitignore

Then add patterns for the files you want to ignore, where * is a wildcard,

*~ *.o *.so *.dll *.exe *.class *.jar

Then, add .gitignore to your repository,

$ git add .gitignore $ git commit -m "Added rules to ignore vim scratch
files and binary files"

`git add --patch`

This is a way to stage only parts of a file. If you have done lots of work without committing, it may be useful to commit your changes as a series of small commits. This command allows you to choose which changes go into which commit so you can group the changes logically.

Guide to git add --patch
Manually editing hunks is the most difficult aspect.

`git commit --author`

You can commit changes made by someone else, by using the --author flag. Consider how this may enable you to collaborate with your colleagues. The syntax is:

git add --author="FirstName Surname <Firstname.Surname@example.com>"

Colours in Git

On many computers, the terminal output is automatically coloured which makes reading the output easier. If your output is not coloured (e.g. in the Sackville/G11 cluster) there is a command which will add the colour (note the spelling of color):

$ git config --global --add color.ui true                       # Note US spelling of color

Add colour to `diff`

$ git config --global color.diff auto

Configure a visual diff tool

git diff is ok, but not very user friendly. It represents changes as removal of a line, followed by the addition of a new line. There are many diff GUIs available, which can be much easier to work with. To view differences with a GUI instead of using the command-line diff tool, first configure git to use your chosen diff tool:

$ git config --global diff.tool diffmerge    # Set diffmerge as your visual diff tool
$ git config --global difftool.prompt false  # Suppress confirmation before launching GUI

Then to use the GUI, use the following command instead of git diff:

$ git difftool

`git stash`

Sometimes you are working on one branch and want to switch to another branch for a while. In order to do so you would normally need to have a clean working directory i.e. no modified files or staged changes. You could commit all the changes you have made, then switch branch, but that would involve committing incomplete work just to return to this state later on. git stash saves the dirty state of your working directory and saves it on a stack of unfinished changes that you can reapply at any time using git stash apply. See here for more details and for examples.

Password manager/ssh-agent

You can have the ssh-agent manage your password for you so you don’t have to keep entering it.

Start the ssh-agent

$ eval `ssh-agent`

Add your key(s) to the agent:

$ ssh-add

Git GUIs

There are a number of available GUIs for working with Git. The official Git page contains a comprehensive list.

However, Git for Windows already comes with all the tools you need (Git Bash, Git GUI, Shell integration).

Some IDEs already have integration with version control e.g. MATLAB, R studio.

Git configuration

The global configuration file for git .gitconfig is automatically created by Git in the home directory. If you set up some basic configuration (in the first steps of this tutorial), it should look like this.

$ cat ~/.gitconfig

[user]
	name = Your Name
	email = yourname@yourplace.org
[core]
	editor = gedit

You can add more configuration options. For example, instead of typing git commit -m we can have a shorter version of this command:

$ git config --global alias.cms 'commit -m'

And now our configuration file will have a new section added:

[alias]
	cms = commit -m

Next time we can simply type:

$ git cms "Commit message"

Completely removing unwanted files from the repository

As we discussed earlier, there are a number of ways to undo what we did in Git. However, most of the time, we actually want to make some amendments rather than discard everything completely. Also often undoing things means, in fact, creating a new commit (not abandoning them). Since Git is a version control system, everything that we recorded in the past commits will be available in the repository.

For example, if you accidentaly commited a file with sensitive data (passwords) in your local repository and then pushed it to the remote repository, the file will be there even if in the next commit-and-push you’ll remove it (git rm).

This article provides a step-by-step tutorial on how to remove completely files from your repository (purge the repository) using git filter-branch.

Removing files from the repository may be useful not only when the files contain sensitive data. Another case may be if you commited a large file in your local repository. Essentially, by default, there are no limitations on the size of files you can commit. However, there may be (and quite likely there will be) limits on the size of the files you can push to remote repositories (GitHub allows for max 100MB). You may encounter an annoying situation when you commited a large file locally and then kept on working making local commits but not pushing. Finally, you decide to push to GitHub (or elsewhere remote) and you can’t because the file is too big. Using git rm won’t help because you are pushing since the last pushed commit and that means in between there is a commit with the large problematic file. To recover from this you will have to purge your large file from the repo (or switch to a different remote repo provider that allows for large files).

Again, as always with Git before you execute the above, make sure you know what you’re doing!

Key Points

List files to ignore by committing a .gitignore file

Selectively stage changes to files using git add --patch

Configure a visual diff tool and use git difftool to view changes

Conclusions and further information

Overview

Teaching: 5 min
Exercises: 0 min

Questions

Where can I find out more?

Objectives

Reflect on how version control would help with the starting scenario

We’ve seen how we can use version control to:

Keep track of changes like a lab notebook for code and documents.
Roll back changes to any point in the history of changes to our files - “undo” and “redo” for files.
Back up our entire history of changes in various locations.
Work on our files from multiple locations.
Identify and resolve conflicts when the same file is edited within two repositories without losing any work.
Collaboratively work on code or documents or any other files.

Now, consider again our initial scenario:

If someone asks you, “Can you process a new data file in exactly the same way as described in your journal paper? Or can I have the code to do it myself?” You can use your version control logs and tags to easily retrieve the exact version of the code that you used.

Version control serves as a log book for your software and documents, ideas you’ve explored, fixes you’ve made, refactorings you’ve done, false paths you’ve explored - what was changed, who by, when and why - with a powerful undo and redo feature!

It also allows you to work with others on a project, whether that be writing code or papers, down to the level of individual files, without the risk of overwriting and losing each others work, and being able to record and understand who changed what, when, and why.

Find out more…

Download and install Git on your own computer (it’s free!)
Atlassian Git tutorials — an excellent resource with clear explanations and illustrations
Learn Git branching — interactive, visual tutorials
K. Ram (2013) “git can facilitate greater reproducibility and increased transparency in science”, Source Code for Biology and Medicine 2013, 8:7 doi:10.1186/1751-0473-8-7 — survey of the range of ways in which version control can help research.
Visual Git Reference — pictorial representations of what Git commands do
Pro Git — the “official” online Git book.
Version control by example — an acclaimed online book on version control by Eric Sink.
Git beyond the basics — a nice reference slideshow covering some more advanced topics
G. Wilson, D. A. Aruliah, C. T. Brown, N. P. Chue Hong, M. Davis, R. T. Guy, S. H. D. Haddock, K. Huff, I. M. Mitchell, M. Plumbley, B. Waugh, E. P. White, P. Wilson (2012) “Best Practices for Scientific Computing”, arXiv:1210.0530 [cs.MS].

Feedback

Please leave some feedback. It’s good to know how things can be improved.

Key Points

Use version control whenever possible

Version control with Git

Introduction

Overview

What is a version control system?

Why use a version control system?

Storing versions (properly)

Restoring previous versions

Understanding what happened

Backup

Collaboration

Example scenario

What files can I track using version control?

Why should I avoid tracking binary files with version control?

Git vs GitHub

Key Points

Tracking changes with a local repository

Overview

Why use the command line?

Setting up Git

Tell Git who we are

Set a default editor

Git’s global configuration

Create a new repository with Git

Tracking files with a git repository

Text editors on your OS

Accessing files from the command line

Add files to a Git repository

Commit changes

Key Points

Looking at history and differences

Overview

Looking at differences

Looking at our history

The HEAD and master pointers

Visualising your own repository as a graph

Using tags as nicknames for commit identifiers

Top tip: tag significant events

Where to create a Git repository?

Exercise: “bio” Repository

Solution

Key Points

Commit advice

Overview

How to write a good commit message

Commit anything that cannot be automatically recreated

When to commit changes?

Key Points

Branching

Overview

What is a branch?

Why are branches useful?

Branching workflows

Branching in practice

Merging and resolving conflicts

Looking at our history - revisited

Add a commit to detached HEAD

Solution

Abandon the commit on a detached HEAD

Solution

Save your changes in a new branch

Solution

Key Points

Lunch break

Overview

Key Points

Undoing changes

Overview

Discarding local changes

Amending the most recent commit

git revert (undo changes associated with a commit)

git reset --hard (restore a previous state by deleting history)

Top tip: do not use git reset with remote branches

How to undo almost anything with Git

Key Points

Working from multiple locations with a remote repository

Overview

GitHub

GitHub for research

Get an account

Set up SSH keys

The `HEAD` and `master` pointers

`git revert` (undo changes associated with a commit)

`git reset --hard` (restore a previous state by deleting history)

Top tip: do not use `git reset` with remote branches

`Fetch` vs `pull`

A worked example using `git rebase <base>`

`man` page

`git add --patch`

`git commit --author`

Add colour to `diff`

`git stash`