Chapter 15 Command-line, git37

15.1 Where are we? Where are we headed?

Up till now, you should have covered:

  • Statistical Programming in R

In conjunction with the markdown/LaTeX chapter, which is mostly used for typesetting and presentation, here we'll introduce the command-line and git, more used for software extensions and version control

15.2 Check your understanding

Check if you have an idea of how you might code the following tasks:

  • What is a GUI?
  • What do the following commands stand for in shell: ls (or dir in Windows), cd, rm, mv (or move in windows), cp (or copy in Windows).
  • What is the difference between a relative path and an absolute path?
  • What paths do these refer to in shell/terminal: ~/, ., ..
  • What is a repository in github?
  • What does it mean to "clone" a repository?

15.3 command-line

Elementary programming operations are done on the command-line, or by entering commands into your computer. This is different from a UI or GUI -- graphical user-interface -- which are interfaces that allow you to click buttons and enter commands in more readable form. Although there are good enough GUIs for most of your needs, you still might need to go under the hood sometimes and run a command.

15.3.1 command-line commands

Open up Terminal in a Mac. (Command Prompt in Windows)

Running this command in a Mac (dir in Windows) should show you a list of all files in the directory that you are currently in.

ls
## 01_warmup.Rmd
## 02_functions.Rmd
## 03_limits.Rmd
## 04_calculus.Rmd
## 05_optimization.Rmd
## 06_probability.Rmd
## 07_linear-algebra.Rmd
## 11_data-handling_counting.Rmd
## 12_matricies-manipulation.Rmd
## 13_functions_obj_loops.Rmd
## 14_visualization.Rmd
## 15_project-dempeace.Rmd
## 16_simulation.Rmd
## 17_non-wysiwyg.Rmd
## 18_text.Rmd
## 19_command-line_git.Rmd
## 21_solutions-warmup.Rmd
## 23_solution_programming.Rmd
## _book
## _bookdown.yml
## _build.sh
## CODE_OF_CONDUCT.md
## CONTRIBUTING.md
## data
## _deploy.sh
## DESCRIPTION
## images
## index.Rmd
## LICENSE
## _output.yml
## preamble.tex
## prefresher_files
## prefresher.log
## prefresher.Rmd
## prefresher.Rproj
## README.md
## R_exercises
## style.css

pwd stands for present working directory (cd in Windows)

pwd
## /home/travis/build/IQSS/prefresher

cd means change directory. You need to give it what to change your current directory to. You can specify a name of another directory in your directory.

Or you can go up to your parent directory. The syntax for that are two periods, .. . One period . refers to the current directory.

cd ..
pwd
## /home/travis/build/IQSS

~/ stands for your home directory defined by your computer.

cd ~/
ls
## apt-get-update.log
## bin
## build
## builds
## build.sh
## filter.rb
## gopath
## otp
## perl5
## R-bin
## texlive
## virtualenv

Using .. and . are "relative" to where you are currently at. So are things like figures/figure1.pdf, which is implicitly writing ./figures/figure1.pdf. These are called relative paths. In contrast, /Users/shirokuriwaki/project1/figures/figure1.pdf is an "absolute" path because it does not start from your current directory.

Relative paths are nice if you have a shared Dropbox, for example, and I had /Users/shirokuriwaki/mathcamp but Connor's path to the same folder is /Users/connorjerzak/mathcamp. To run the same code in mathcamp, we should be using relative paths that start from "mathcamp". Relative paths are also shorter, and they are invariant to higher-level changes in your computer.

15.3.2 running things via command-line

Suppose you have a simple Rscript, call it hello_world.R. This is simply a plain text file that contains

cat("Hello World")

Then in command-line, go to the directory that contains hello_world.R and enter

Rscript hello_world.R

This should give you the output Hello World, which verifies that you "executed" the file with R via the command-line.

15.3.3 why do command-line?

If you know exactly what you want to do your files and the changes are local, then command-line might be faster and be more sensible than navigating yourself through a GUI. For example, what if you wanted a single command that will run 10 R scripts successively at once (as Gentzkow and Shapiro suggest you should do in your research)? It is tedious to run each of your scripts on Rstudio, especially if running some take more than a few minutes. Instead you could write a "batch" script that you can run on the terminal,

Rscript 01_read_data.R
Rscript 02_merge_data.R
Rscript 03_run_regressions.R
Rscript 04_make_graphs.R
Rscript 05_maketable.R

Then run this single file, call it run_all_Rscripts.sh, on your terminal as

sh run_all_Rscripts.sh

On the other hand, command-line prompts may require more keystrokes, and is also less intuitive than a good GUI. It can also be dangerous for beginners, because it can allow you to make large irreversible changes inadvertently. For example, removing a file (rm) has no "Undo" feature.

15.4 git

Git is a tool for version control. It comes pre-installed on Macs, you will probably need to install it yourself on Windows.

15.4.1 why version control?

All version control software should be built to

  • preserve all snapshots of your work
  • and catalog them in such a way that you can refer back or even revert back your files to the past snapshot.
  • makes it easy to see exactly which parts of your files you changed between directories.

Further, git is most commonly used for collaborative work.

  • maintains "branches", or parallel universes of your files that people can switch back and forth on, doing version control on each one
  • makes it easy to "merge" a sub-branch to a master branch when it is ready.

Note that Dropbox is useful for collaborative work too. But the added value of git's branches is that people can make different changes simultaneously on their computers and merge them to the master branch later. In Dropbox, there is only one copy of each thing so simultaneous editing is not possible.

15.4.2 open-source code at your fingertips

Some links to check out:

GitHub https://github.com is the GUI to git. Making an account there is free. Making an account will allow you to be a part of the collaborative programming community. It will also allow you to "fork" other people's "repositories". "Forking" is making your own copy of the project that forks off from the master project at a point in time. A "repository" is simply the name of your main project directory.

"cloning" someone else's repository is similar to forking -- it gives you your own copy.

15.4.3 commands in git

As you might have noticed from all the quoted terms, git uses a lot of its own terms that are not intuitive and hard to remember at first. The nuts and bolts of maintaining your version control further requires "adding", "committing", and "push"ing, sometimes "pull"ing.

The tutorial https://try.github.io/ is quite good. You'd want to have familiarity with command-line to fully understand this and use it in your work.

RStudio Projects has a great git GUI as well.

15.4.4 is git worth it?

While git is a powerful tool, you may choose to not use it for everything because

  • git is mainly for code, not data. It has a fairly strict limit on the size of your dataset that you cover.
  • your collaborators might want to work with Dropbox
  • unless you get a paid account, all your repositories will be public.

  1. Module originally written by Shiro Kuriwaki