class: center, middle # Version control and collaboration # with `git` and GitHub Jakub Kaczmarzyk ([@kaczmarj](https://github.com/kaczmarj)) Koo Lab
MD-PhD program at Stony Brook May 12, 2021 --- # What is version control? > Version control is a system that records changes to a file or set of files over time > so that you can recall specific versions later. > > Source: [Pro Git Book](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) --- # Naive version control - Multiple files or directories - script.py - script_20210501.py - script_20210504.py -- - Emailing a file to yourself -- - Relying on multiple backups -- - Trying to remember what we did --- class: center, middle
--- class: center, middle
--- class: center, middle
--- # Why version control - Stay organized -- - Save snapshots of your work -- - Experiment at will -- - Enable reproducibility -- - Quantify how much you've done -- - Share and collaborate with the world --- # Agenda 1. Introduction to version control with `git` - Create a repository - Common workflow - Branches - Merge conflicts 3. Introduction to GitHub - What is it? - Uploading your repository - Making a pull request - Collaborating --- # `git` - _The stupid content tracker_ Open a terminal, and type this in the console. ```bash git --help ``` I suggest typing out the commands yourself rather than copying and pasting. I find that this helps me remember the commands. ---- If an error like `program not found` appears, please install `git` from https://git-scm.com/download. --- # `git config` | getting started Let us introduce ourselves. ```bash git config --global user.name "My Name" git config --global user.email "me@email.com" ``` Our name and email address will be associated with all of our _commits_. ---- [_commit_](https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefcommitacommit): > As a noun: A single point in the Git history; the entire history of a project is represented as a set of interrelated commits. The word "commit" is often used by Git in the same places other revision control systems use the words "revision" or "version". Also used as a short hand for commit object. > > As a verb: The action of storing a new snapshot of the project’s state in the Git history, by creating a new commit representing the current state of the index and advancing HEAD to point at the new commit. --- # commits visualized A commit is a checkpoint, and each circle represents one commit. The left-most circle is the initial commit. Each subsequent circle represents changes made in the content (e.g., code). Commits form a directed acyclic graph (DAG). _Each commit stores the differences in content from the previous commit_.
Flowchart
One can view the state of the repository at any commit. ---- "Everything not saved will be lost." (Nintendo "quit screen" message) --- # `git init` | create a repository We use the command [`git init`](https://git-scm.com/docs/git-init) to initialize a repository. >This command creates an empty Git repository - basically a `.git` directory with subdirectories for `objects` [an immutable unit of storage] [...]. An initial branch without any commits will be created. ```bash mkdir git-tutorial cd git-tutorial git init ``` -- At this point, we have an empty Git repository. Look at the output of `git status` to see the current status of the repository. ```bash git status ``` ---- **`git status` is your friend.** If you are unsure about the state of your repository, ask `git status`. If you are confused, it will usually tell you what to do next. --- # add content We'll be making a cake. Create a file called `recipe.txt` with the following content: ```diff Let's bake a cake! ``` -- And then look at the output of ```bash git status ``` ```diff On branch main No commits yet Untracked files: (use "git add
..." to include in what will be committed) recipe.txt nothing added to commit but untracked files present (use "git add" to track) ``` --- # `git add` | ask `git` to track files Although it might seem so, `git` cannot read minds. It does not know about files until we explicitly tell it to track the files. To do this, we use [`git add`](https://git-scm.com/docs/git-add). ```bash git add recipe.txt ``` -- Again, look at the output of ```bash git status ``` ```diff On branch main No commits yet Changes to be committed: (use "git rm --cached
..." to unstage) new file: recipe.txt ``` --- # `git commit` | record changes Now that `git` is aware of our file, we want to commit it to memory. To do this, we use [`git commit`](https://git-scm.com/docs/git-commit). ```bash git commit --message "initial commit of recipe" recipe.txt ``` Every commit has a corresponding message. __Make it descriptive!__ -- Again, ask our old friend `git status` for the rundown. ```bash git status ``` ```bash On branch main nothing to commit, working tree clean ``` ---- Ah, _working tree clean_. --- ## Another way to write a commit message If we had omitted the `-m/--message` option, a text editor would open for us to write in. This is especially useful when writing multi-line commit messages. ```bash git commit recipe.txt ``` To modify the default editor (uses `emacs` below): ```bash git config --global core.editor emacs ``` --- ## As a DAG
This is the root commit, the ancestor of all future commits. -- ---- .center[_Don't get lazy._] .center[] --- # Track a change Add the line below to `recipe.txt`: ```bash First, pre-heat the oven to 375 F. ``` `recipe.txt` should now look like this: ```diff Let's bake a cake! First, pre-heat the oven to 375 F. ``` -- We visit our friend, quickly becoming family, `git status`: ```diff On branch main Changes not staged for commit: (use "git add
..." to update what will be committed) (use "git restore
..." to discard changes in working directory) modified: recipe.txt no changes added to commit (use "git add" and/or "git commit -a") ``` --- # View changes Use `git diff` to see the differences. ```bash git diff recipe.txt ``` -- ```diff diff --git a/recipe.txt b/recipe.txt index 6d396e0..54c8255 100644 --- a/recipe.txt +++ b/recipe.txt @@ -1 +1,2 @@ Let's bake a cake! +First, pre-heat the oven to 375 F. ``` --- # Track a change - staging area .center[
] We use `git add` to add changes to the [staging area](https://git-scm.com/about/staging-area). The staging area allows us to review what we want to commit before we commit. ```bash git add recipe.txt ``` Then check the status. ```bash git status ``` --- # Track a change - commit to memory Now that we are confident we want to commit, let us come up with a _descriptive_ message and commit the changes. ```bash git commit -m "add instruction to preheat oven" ``` Notice that we did not add any filepaths to the command. `git commit` will commit everything in the staging area. -- ## Our commits as a DAG
--- # `git log` | show commit logs To view metadata about recent commits, we can use ```bash git log ``` ```diff commit d2926c141c71799d41ac821804e3ba9759ceabdd (HEAD -> main) Author: John Doe
Date: Wed May 12 10:17:44 2021 -0400 add instruction to preheat oven commit 509774206bc1fa4c39f9c4c55e4a46ffc50acf94 Author: John Doe
Date: Wed May 12 10:07:29 2021 -0400 initial commit of recipe ``` --- # Jakub's workflow ```bash # Initialize the git repository and commit files initially. git init git add FILE... git commit -m "initial commit" # Make changes to files ... git status # See the current status of the repository git diff # See what has changed git add FILE... # Add files to the staging area git status # Inspect the staging area # Commit changes to memory git commit -m "a useful commit message here" # Create a new file things.txt ... git status # See the current status git add things.txt # Tell git to track the file git status # Inspect the staging area # Commit file memory. git commit -m "useful message" # And so on ... ``` --- # Branches
Layer 1
Main branch
Feature
--- # Branches We want to work on a new feature, but it will take time and get messy. We want to preserve the sanctity of our main branch while working on this feature, so to do this, _we deviate_ from our main branch and create a new one. Later on, _we will merge_ these changes back into the main branch. **Heuristic: once there is a working main branch, _all new content should be added through branches_.** This heuristic leads to more organized code and a cleaner main branch.
Layer 1
Main branch
Feature
--- # `git checkout` | switch to a branch To create a new branch and checkout to it, use ```bash git checkout -b add/ingredients ``` where `add/ingredients` is the name of the branch. We chose this name because we will be adding the ingredients for the recipe. See how we are describing the purpose of this feature concisely in the branch name? -- Now if you run ```bash git status ``` you will see that we are in the `add/ingredients` branch. You can switch between branches using ```bash git checkout BRANCH ``` where `BRANCH` is the name of the branch. --- # Where we are
Layer 1
Preheat
--- # Add a feature Let's stay in the branch `add/ingredients`, and let's add this line to `recipe.txt`. ```diff Second, get some eggs, flour, butter, and chocolate. ``` `recipe.txt` should now contain ```diff Let's bake a cake! First, pre-heat the oven to 375 F. Second, get some eggs, flour, butter, and chocolate. ``` -- ## Commit the changes ```bash git add recipe.txt git commit -m "add instruction to get ingredients" ``` --- # Where we are
Layer 1
Ingredients
Preheat
--- # `git merge` | merge the changes We are satisfied with the state of the new feature. Now we want to merge these changes into the main branch. ```bash git checkout main # whatever the main branch is git merge add/ingredients ``` -- There it is! We have merged our feature into the main branch. If you look at `recipe.txt`, you should see the following: ```diff Let's bake a cake! First, pre-heat the oven to 375 F. Second, get some eggs, flour, butter, and chocolate. ``` --- # Where we are
Layer 1
Ingredients
Preheat
Recipe so far
--- # Merge conflicts A merge conflict happens when changes conflict. For example, while working on your feature branch, someone might have modified some lines of code that you are also modifying. When it comes time to merge your branch, `git` does not know which changes to keep. In the case of a merge conflict, you will have to tell `git` which changes should be kept. In the code below, we will create a merge conflict by creating a feature branch and committing changes, while also creating another feature branch from master and merging changes in the same lines. ```bash git checkout -b add/step3-flour echo "Third, add flour." >> recipe.txt git commit -m "add flour" recipe.txt git checkout main # checkout to main first, so we branch from main! git checkout -b add/step4-eggs echo "Fourth, add eggs." >> recipe.txt git commit -m "add eggs" recipe.txt git checkout main git merge add/step4-eggs # merge step4 before we merged step3 ``` --- # Merge conflicts Let's try to merge our Milky Way feature. ```bash git merge add/step3-flour ``` -- Uh oh... ```diff Auto-merging recipe.txt CONFLICT (content): Merge conflict in recipe.txt Automatic merge failed; fix conflicts and then commit the result. ``` The main issue is that the feature branch was not up to date with main. --- # Merge conflicts **Heuristic: always keep branches updated with main. Work out merge conflicts _before_ merging branches into main.** Let's abort the change and fix the merge issues in the feature branch. We want a nice clean merge into the main branch. ```bash git merge --abort git checkout add/step3-flour git merge main # Merge main branch into feature branch ``` We have the same error, but we will take care of it here... --- # Merge conflicts - resolution The contents of `recipe.txt` are now ```diff Let's bake a cake! First, pre-heat the oven to 375 F. Second, get some eggs, flour, butter, and chocolate. <<<<<<< HEAD Third, add flour. ======= Fourth, add eggs. >>>>>>> main ``` The goal here is to modify the content between `<<< >>>` to our satisfaction. We could keep step 3, step 4, or we could keep both lines, or replace the lines with something entirely different. In this case, let's keep both lines, with step 3 coming before step 4. The file should look like this now. ```diff Let's bake a cake! First, pre-heat the oven to 375 F. Second, get some eggs, flour, butter, and chocolate. Third, add flour. Fourth, add eggs. ``` --- # Merge conflicts - resolution Our friend `git status` will help us. After we made our changes, it suggests using `git add`. ```bash git add recipe.txt git commit -m "merge main branch" ``` Notice this commit message is about merging. This commit represents the merging of master into the feature branch. In other words, we have updated our feature branch with the current state of master. If we merge the feature branch into master now, there should be no problems. ```bash git checkout main git merge add/step3-flour ``` And `recipe.txt` looks like this ```diff Let's bake a cake! First, pre-heat the oven to 375 F. Second, get some eggs, flour, butter, and chocolate. Third, add flour. Fourth, add eggs. ``` --- # Jakub's workflow - with branches Assume we have a `git` repository. ```bash git checkout main # branch from main git checkout -b BRANCH # Make changes ... git add FILE... git commit -m "a useful commit message here" # Periodically merge updates from the main branch. # Work out any merge conflicts while in the feature branch. git merge main # Merge the changes into main. git checkout main git merge BRANCH ``` --- # Heuristics regarding branches 1. Once there is a working main branch, _all new content should be added through branches._ - What does "working" mean? That is up to you. Perhaps it means that the repository has a file in it. 1. Always keep branches updated with main. _Work out merge conflicts before merging branches into main._ --- # Questions? If you remember one thing, let it be ```bash git status ``` Here is a [`git` cheatsheet from GitHub](https://training.github.com/downloads/github-git-cheat-sheet/). --- # GitHub A place to host and collaborate on code versioned with `git`. If the code is open-source, chances are it is available on GitHub. Also serves as a backup. If anything happens to your computer, the code is online. --- .center[
] --- # Upload your repository Let's upload the code we have been diligently working on. First, we need to ask GitHub to create a new repository for us. - Go to https://github.com/, click the plus sign in the upper-left, and select "New repository". - Or go to https://github.com/new. You end up in the same place. -- Second, we need to decide on a name for the repository. `trying-github` could work. Optionally add a short description, and leave everything else unselected. Create the repository! --- # Upload your repository We want to push an existing repository from the command line. GitHub gives us instructions on how to "push" our code to this repository. ```bash git remote add origin git@github.com:USERNAME/REPO.git git branch -M main git push -u origin main ``` The line `git branch -M main` renames master to main. [See here for more information](https://github.com/github/renaming). --- # `git remote` [remote repository](https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefremotearemoterepository) > A repository which is used to track the same project but resides somewhere else. To communicate with remotes, see fetch or push. ```bash git remote add origin git@github.com:USERNAME/REPO.git ``` This adds a new `remote` called `origin`, and it specifies the URL. Your URL might look slightly different from mine, but that's fine. Now we can interact with this remote. We can _push_ code to it and _fetch_ changes from it. The remote does not have to be on GitHub, but for our purposes, it almost always will be. --- # `git push` | push to the remote ```bash git push REMOTE BRANCH ``` This pushes our code to the remote. We can specify the branch we want to push. --- # Pull requests Remember that we work on new code in branches? We can push those branches to GitHub, and then we can create a pull request. A pull request is a request to merge my changes into a branch. Most times, a pull request will be made between a feature branch and the main branch. It is a request to merge the feature branch into the main branch. In a pull request, users can discuss changes, request certain changes, and see the entire history of the branch. --- # Collaboration - one remote Let's create a new feature branch in our `git` repository. ```bash git checkout main # make sure we branch from main git checkout -b fix/intro ``` -- Change the first line of `recipe.txt` to ```diff Let's bake a delicious cake! ``` -- Commit the changes. ```bash git add recipe.txt git commit -m "inform people that the cake will be delicious" ``` -- Push the changes to the remote (i.e., the GitHub repository). ```bash git push origin fix/intro ``` --- # Collaboration - one remote Go to your GitHub repository. You might see a yellow alert about a recently-pushed branch. You can click "Compare & pull request" on that alert, or you can go to "Pull requests". We want to merge the branch `fix/intro` into `main`. Then click "Create pull request". --- class: center, middle
--- # Collaborating - multiple remotes The workflow with multiple remotes is similar to the one with one remote, but of course we need to manage multiple remotes. Common setup: A project is hosted in one repository on GitHub. Collaborators contribute by "forking" the repository, making changes within their forks, and making pull requests back to the main repository. ---- What the _fork?_ > A fork is a copy of a repository that you manage. Forks let you make changes to a project without affecting the original repository. You can fetch updates from or submit changes to the original repository with pull requests. - [source](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-forks) --- # Collaborating - multiple remotes In the upper-right of any GitHub repository, there is a button to "Fork" that repository. This makes a copy of the code in your account. And you are the owner of that fork. You do not affect the original repository in any way by forking. After we fork the repository, we clone the repository onto our computers so we can work on the code. There is a green button with the word "Code". Click that, and you are given a URL that you can use to clone the repository. So first, fork the repository. Then clone it onto your computer. ```bash git clone git@github.com:YOURUSERNAME/REPO.git ``` This will create a directory `trying-github` with the code. --- # Collaborating - multiple remotes Remember that we make changes in branches? And remember those branches should be up-to-date with the main branch? The same applies with forks. We want to keep our forks updated with the original repository. The original repository (the one we want to contribute to) is often called the "upstream" repository. We need to point `git` to the upstream. ```bash git remote add upstream https://github.com/ORIGINALUSER/REPO.git ``` ---- The general usage is ```bash git remote add NAME_OF_REMOTE REMOTE ``` --- # Collaborating - multiple remotes Now we want to download information about the upstream with ```bash git fetch upstream ``` And everytime we want to update our branch with `upstream`, we use ```bash git merge upstream/master ``` You might run into merge conflicts, but now you are ready to tackle them! --- # Collaborating - multiple remotes Work on new code in branches, and push the changes to your repository (the remote is `origin`). Then you can submit a pull request to the original repository. This asks the owner of the original repository to make your changes part of the original repository. --- # Version control for data? Git is not designed for data. You can use it, but it's not the right tool. Here are a few tools designed to version large files: - [git-annex](https://git-annex.branchable.com/) - [git-lfs](https://git-lfs.github.com/) (large file storage) - [DataLad](https://www.datalad.org/) --- class: center, middle # Thank you!