Robust workflow for replicability and reliability with Eclipse +
StatET
SPSS
and Office are easy to learn but lack the power and extensibility of R,
SQL,
LaTeX/Sweave, BibTeX, and Subversion (SVN). These open-source
technologies were
developed based on peer-reviewed code and are designed to facilitate
replicability, but they do not just work 'out of the box.' This
page contains instructions for how to configure these technologies and
set up
a workflow for data management, analysis, and
word-processing/typsetting using these tools. Though I pay
special attention to Mac OS, PC/Linux users should be able to implement
everything below.
The steps are roughly as follows:
1. Install R.
2. Install LaTeX (long download).
3. Install Eclipse.
4. Install Eclipse plug-in StatET and configure.
5. Configure Sweave and pgfSweave.
6. Install Subversion (SVN) and Subclipse.
7. Install Zotero and configure it to make it play well with BibTeX.
This should take
about an hour or so, depending on your Internet connection,
distractions, level of familiarity with these kinds of tools etc.
R for Data Analysis
First, install R. If you don't already know R, learn it asap.
There are some excellent links to introductory
videos and texts on the panel to the right ======>>>
Install R from this site:
http://cran.r-project.org/
One common complaint about R is that it is RAM-hungry. This is
true, but the user can easily use a SQL database such as
SQLite
with R using the SQLiteDF package to augment its data processing
capabilities. Another solid option is R package Bigmemory, which
is designed to help users analyze datasets > 10 gigabytes. See the
Bigmemory vignette for details.
If you do large-scale text analysis, check out the
tm package,
which is set up to automatically database term document matrices and
offers parralellization via
Rmpi.
It also plays well with
topicmodels,
which implements
Latent
Dirichlet Allocation (LDA) and
correlated
topic models (CTM).
LaTeX (pronounced 'Lay-Tek') for beautiful typsetting
Get a TeX distribution from one of the following sites:
Mac:
http://www.tug.org/mactex/
Windows:
http://miktex.org/
Eclipse, the editor (for just about everything)
Eclipse is a cross-platform open-source editor based on Java and
originally developed by IBM in 2001. It provides provides an
integrated development environment (IDE) meaning that it provides a
source code editor, compiler or (more critically for our proposes) an
interpreter, build automation tools, and a debugger (though debugging
must still be done in R and LaTeX via the command line). Eclipse
also features an an extensible plug-in system. Among other
things, it was designed with visual programming in mind, meaning that
it assists programming tasks by representing code elements visually,
and sometimes providing the capability to allow programmers manipulate
program elements graphically instead of merely via text (code).
Eclipse is one of the most widely used editors/IDEs in existance.
One
survey
suggests that Eclipse is the third most-used IDE in existence (second
to MS-Visual Studio and Adobe Macromedia Studio). It easily the
most widely used open source IDE, and some predict that because of its
rapid growth and development it will eventually rise to the number-one
spot. What this means is that the codebase is robust and
stable.
Why not just use the default R editor? Well, in addition to
Sweave functionality, Eclipse provides a connection to R with shortcut
keys, R & Sweave syntax highlighting, hover functionality, an
outstanding graphical object browser, an outline that links to new
objects
and functions declared in your code, and it almost
never crashes,
so when (not if) R crashes, you don't lose your R code. It also
has find and replace function with RegEx support, toggle code
commenting (command-shift-C or control-shift-C), content assist, and
other very nice features. Though it is a resource-intensive IDE,
it is easier to learn than
Emacs or Aquamacs.
Install the latest version of Eclipse, which we will also use as an
R/LaTeX/Sweave editor:
http://www.eclipse.org/downloads/
(you can download any version, I use the Eclipse IDE for Java
Developers).
Also, you may wish to
edit the configuration file to speed up Eclipse.
"Local history": Robust local version control within Eclipse
One thing I recommend immediately upon downloading Eclipse is to set
its local history size to unlimited. This means that Eclipse tracks
your changes to each file in your project, each time you save it.
Go to Preferences => General => Workspace => Local History and
un-check "Limit history size."
If you need to compare your current document to a previous document,
simply right-click (or Ctrl-click) on the file and select "Compare
with" => "Local history..." Each saved revision to the
document will then appear in a side window. Double click on each
one to compare the differences between the current and saved
document. This works great when you accidentally save/write a
file over the one you previously were using (i.e., for your
dissertation/journal manuscript), because you can simply use compare
with local history to restore all of your previous hard work.
Also, if you deleted a file by accident, you can right click on the
project folder and select "Restore from Local History..." and restore
your deleted file.
The StatET plugin for Eclipse
You'll want to install the StatET plugin for R:
http://www.walware.de/goto/statet
Be sure to install the rJava and rj packages for R:
From R, type
install.packages("rJava")
install.packages("rj", repos="http://download.walware.de/rj-0.5")
Install StatET from Eclipse, by selecting Help ==> Install New
Software... and pasting
the
appropriate url into the "work with:" prompt:
http://download.walware.de/eclipse-3.7
[or your version of Eclipse]
Once you install StatET, you'll want to run the cheatsheets to
optimize your Eclipse configuration. From Eclipse, go to Help =>
Cheat
Sheets... and click on the StatET folder. Run the cheat sheets in
order.
Luke
Miller has a nice step-by-step walkthrough of this setup on his blog.
Longhow
Lam has a short book on the StatET plug-in for Eclilpse here.
Basics and Shortcuts
Now that you've run through all of the cheat sheets, StatET should be
the default perspective and the R console should run
automatically. You might want to start by making a new R
project. If you want to import an existing project, just make a
new project with the same folder name, and copy the folder into your
workspace.
You can send R code that you write to the R console by pressing '
command+R, command+R' - that's
command+R twice.
A full list of shortcut keys is available by clicking '
command+shift+L, command+shift+L'.
I recommend changing the assignment (' <- ') shortcut to
'command + shift + , ' or something similar and changing the add docu
comment (' ## ') shortcut to something like 'command + / '. [The
default shortcut combinations seem not to map to any existing keys on
my keyboard...].
One more useful shortcut key: content assist: 'ctrl+SPACE'
Writing LaTeX documents in Eclipse with Texlispe
StatET also installs Texlipse, which is rather nice way to compose
latex documents, especially if you are collaborating with others.
You'll need to set up a latex project first, which can be done via
File, New.... Each time you save the .tex file in the project,
Eclipse will compile the latex file for you. If you want to
change the name of the output file, right-click on the project folder,
select Properties, select Latex Project Properties from the menu on the
left.
One you have a latex project going, you'll want to tweak some of the
settings to get the most out of Ecilpse. Here's how:
Let's take care of spelling first.
ENABLE SPELL CHECKING for LaTeX:
Open Eclipse Preferences, select General, Editors, Text Editors,
Spelling,
and make sure spelling is enabled. You'll also need to specifiy a
dictionary. I use this one:
Dictionary.txt
You can just download this dictionary into some directory (e.g., the
main Eclipse directory), then point to it in the "User defined
dictionary" dialogue box.
Now from the Preferences window, go to Texlipse, Spell Checker, and
make sure built-in spell-checker is selected.
You can now press '
Command + 1'
( or 'Ctrl+1' on a PC) on a word with a squiggly red line under it to
pull up a menu to either correct the word or add it to your dictionary.
Now LINE-WRAPPING. You can either hard-wrap your lines or soft
wrap them. For hard-wraping, which I recommend if you are
collaborating with others (it will be easier to see edits when you are
comparing changes in svn), from Preferences go to Texlipse, Editor and
select 'Use hard wrapping.' Eclipse will insert a return
character when reach the number of characters you specify in the
'Number of characters in line (10-1000) dialogue box.
If you edit something and the line wrapping gets screwed up, just press
Esc Q or select Latex >
Correct Line Wrap, and eclipse will make everything pretty for you
again.
For soft-wrapping, I recommend the
Ahtik plug in.
In Eclipse, go to help ==> Install New Software... and use the
following as the url in the "work with:" prompt:
http://ahtik.com/eclipse-update/
Easy TABLE CREATION is facilitated by Eclipse's 'LaTeX Table
View.' Make sure it is visible first. Click on Window, Open
Perspective, and select Other. Select 'LaTeX' from the menu and
hit OK. If you do not see LaTeX Table View on some tab within
Eclipse, go to Window, Show View, and select LaTeX Table View.
You can now create tables in this spread-sheet-like editor, right-click
and select 'Export to clipboard,' and it will format your table with
&'s and nice spacing. If you want to edit a table you've
produced previously or in R (say via xtable()), right click on the
LaTeX Table View editor, and select 'Import selected lines from
editor.' When you are done just export to clipboard as previously
described.
Sweave & pgfSweave
Sweave
let's you do your data analysis and writing all in the same
place. I prefer
pgfSweave,
which caches your analysis and graphics, so that you are not running
that 30 minute bayesian estimation proceedure each time you want to
make a small cosmetic change to your document and see the results in
pdf. It also matches the font in R
graphics to whatever you are working with in your document, and
provides R syntax highlighting for any R code in your
document -
see
the pgfSweave vignette here for details. First install
pgfSweave in the R terminal:
install.packages("pgfSweave")
To get pgfSweave working in Eclipse, go to Run => External Tools
==> External Tools Configurations, and double-click on 'Sweave
Document Processing'. Where it says 'Run command in active R
Console:' replace the existing Sweave text with the following:
library(pgfSweave)
pgfSweave(file =
"${resource_loc:${source_file_path}}")
Now is also a good time to make sure that your Sweave/pgfSweave build
tools are set up correctly, by navigating to the LaTeX tab and clicking
on the blue underlined link that reads "Setup build tools..." If
the various tools have directories listed, you're golden. If not,
type the location of your TeX distribution where it reads "Bin
directory of TeX distribution:" which on a mac should be as
follows:
/usr/local/texlive/[current tex year]/bin/universal-darwin/
[e.g.] /usr/local/texlive/2010/bin/universal-darwin/
You may also need to add
this directory to your
system path (I had to). To do so (on OS 10.6 Snow Leopard) open
the terminal and type the following:
sudo pico /etc/paths
You can add this line via the pico editor.
You may also want to tell Mac's Finder to display all files. To do so
enter the following in a terminal window:
defaults write com.apple.Finder
AppleShowAllFiles YES
You have to restart finder for it to work. Hold option, click and hold
the finder icon and select relaunch. (Thanks to Sean Westwood for this
hint).
To use pgfsweave, you'll want to add the following lines to any Sweave
(.Rnw)
document:
\usepackage{Sweave}
\usepackage{tikz}
\usepackage{pgf}
THE SWEAVE.STY ISSUE - the most frustrating
Sweave issue
when you are just starting out - Sweave/LaTeX
for no good reason cannot find the 'Sweave.sty,' latex package (the
file is just in your ~/R/R-x.x.x/share/texmf directory). Do
yourself a favor and just download this copy of
Sweave.sty and put it in your TeX distribution folder or the current
directory you're working in:
Sweave.sty
You can also use R itself to call LaTeX such that you use R to generate
your pdf directly. You can do this via Run => External Tools
==> External Tools Configurations, and double-click on 'Sweave
Document Processing', click on the LaTeX tab, then select the option
for "Build tex file using the R command:". The R command should
be filled in for you.
You can do this by using the option 'Build tex file
using the R command' in the LaTeX tab of the Sweave
document processing profile
ENABLE SPELL CHECKING in Sweave/pgfSweave:
Open Eclipse Preferences, select StatET, Source Editors, Sweave Editor
and click on "Enable spell checking." Where it says "Note: On the
Spelling preference page..." click on Spelling. This will take
you to the main spelling preferences page. You'll need to specifiy a
dictionary. I use this dictionary:
Dictionary.txt
You can just download this dictionary into some directory (e.g., the
main Eclipse directory), then point to it in the spelling preferences
page.
ENABLE LINE WRAPPING in Sweave/pgfSweave:
For now the easiest way to enable line wrapping is by installing the
Ahtik plug in.
In Eclipse, go to help ==> Install New Software... and use the
following as the url in the "work with:" prompt:
http://ahtik.com/eclipse-update/
Examples - Sweave & pgfSweave
Often the best way to learn something like this is by looking at
examples. Below I've posted example code for a few projects in
Sweave and pgfSweave:
APSAPoster5.Rnw - (
Bias in the Flesh Poster (PDF), presented at
the American Political Science Association 2009).
BIAS-AMP-ICA5.Rnw - (
Bias in the Flesh writeup (PDF),
presented at the International Communication Association 2010, DO NOT
CITE THE .RNW FILE).
pgfSweaveEXAMPLE.Rnw - (prelim phase
of an
NLP project (PDF), DO NOT
CITE).
Subclipse (easy to use Subversion interface in Eclipse)
Subversion is version control
software, great for collaborating on large projects. In short, it
keeps track of every change made to a series of files and is generally
an excellent way to prevent data loss and track changes between
documents.
First you
need subversion 1.6 with java bindings if you want to use it with
Eclipse:
MAC: go to collabnet and download subversion 1.6 for OS *10.6* (second
option), which has the right java bindings. You'll have to sign up for
access, but it's quick and painless.
http://www.open.collab.net/downloads/community/
PC: TortiseSVN may be best for windows. Install guide here:
http://www.woodwardweb.com/java/howto_configure.html.
I recommend Subclipse, which seems to play well with StatET (a
colleague reported not being able to use some SVN plugin for Eclipse
with StatET installed). From eclipse, install new software, using
this as the repository:
http://subclipse.tigris.org/update_1.6.x.
Go to Preferences, Team, SVN, and change the SVN interface to 'SVNKit
(Pure Java) ... ' as there appear to be problems with the default
JavaHL (JNI) interface.
Your institution should have instructions to set up subversion, for
example
Stanford's
SVN setup instructions are here. You can also
set
up subversion locally if you want a nice version-control tool.
If your project is open source,
Google code hosts
subversion repositories for free (up to 4 gigs), and as with most
things Google, it is easy to set up and use. You can
use
Google code's svn with Eclipse/Subclipse.
Using Google code's svn means that your project is publically available
(in read-only form) via svn. However, by hiding the "source" tab
in your project's web view, you can remove references to the svn's url
from your google code website.
To connect to an existing repository, go to File, New, other, and
expand the SVN options folder, then select 'Checkout Projects from
SVN.' Select 'Create a new respository' (this creates a new
respository setting locally, not on the server). Then enter the
svn url. If you connect via ssh, which is often required for
institutions, a typical svn url is formatted as follows:
svn+ssh://username@domain.name.edu/folders/svn/project
There are three main svn commands you'll need: synchronize, update,
commit. Generally, you'll want to sychronize your project before
starting work on it to make sure you have the most recent version, and
after working on it to make sure that your changes are saved to the
server and available to your collaborators. Right click on a
project, select Team then Synchronize with Repository.
Sychronize
will open a new view in Eclipse that will show the files that are
different on your local svn versus the server. You can
examine the specific differences between documents by right-clicking on
any document that has been changed and selecting 'Open in Compare
Editor'. If you are ok with the changes, right click on the
project and select Update.
Or, if you prefer, you can simply update your local version with the
version on the server without looking at the changes. To do so,
simply right click on the project and select Team, Update to
HEAD.
When you are finished working with a project, go back to the team menu
(right click on the project, select Team). You can either
Synchronize with Repository again, or you can just select commit.
BibTeX for bibliography management
BibTeX is LaTeX's bibliography manager. The idea is that you
provide something like:
According to
\citet{messing2009Bias}, the McCain campaign used increasingly
``photoshopped'' images of Obama in their ads as election day
approached.
And your pdf output looks like:
"According to Messing (2009), the McCain campaign used
increasingly "photoshopped" images of Obama in their ads as election
day approached."
And this reference is automatically inserted into your bibliography (in
the correct order and format):
Messing, S., Plaut, E., & Jabon, M. (2009). Bias in the Flesh:
Attack Ads in the 2008 Presidential Campaign. In
Proceedings of the 2009 American Political
Science Association Annual Meeting.
Your references are all stored in a database and are programmatically
retrieved behind the scenes by LaTeX. To set BibTeX up with LaTeX
if
you use APA format, first download this .bst file:
apaish.bst
Also, you'll want to add the following lines to your LaTeX preamble,
which tell latex to utilize the natbib and booktabs packages:
\usepackage{natbib}
\usepackage{booktabs}
You'll also want to add the following lines where you want your
references to be located, e.g., just before \document{end}:
\bibliographystyle{/Users/user01/documents/workspace/apaish}
\bibliography{/Users/user01/documents/workspace/OpenResponse/MyLibrary}
Replace the paths to these files with the appropriate path on your
machine. Note that you do not need to give BibTeX extentions to
specifiy apaish.sty or MyLibrary.bib.
Zotero + BibTeX to take the pain out of your lit review
Zotero is Firefox plug-in that makes building a bibliography database
much less painful than it used to be. If you are browsing and see
an article you want to cite in a journal website, Google Scholar,
Amazon.com, etc., you simply hit a button in your brower's address bar and Zotero imports the
citation into your citation database. Here's what it looks like:
Note that you'll want to check citations each time you import them to
make sure they have all the relevant entries you need. I
recommend establishing a folder for each project you work on to keep
things straight.
You can also share bibliographies with your collaborators, by making a
group library. For example, the
Stanford Comm
Department has a group library so
that we can collaboratively build on each other's citation
databases.
Though Zotero has a plug in for MS-Word and Open Office, LaTeX
documents look much nicer and are less fragile than MS-Word and Ooo
documents (I've had MS-Word drop all of my citations for reasons that I
cannot figure, and then had to manually re-enter them via the graphical
user interface).
One really nice thing about Zotero is that you can configure it so that
if you drag and drop a bibliography entry into your LaTeX document,
Zotero will automatically generate the citation for you. Go to
Zotero Preferences. Open the Advanced pane. Click on "Show
Data
Directory." This will take you to a "zotero" folder. The
"zotero"
folder will contain a "translators" folder. You should be in this
directory
~/.mozilla/XXXXXXX/zotero/translators/
where XXXXXXXX is some random string.
Download
this BibtexCiteKeyOnly.js file
and save it to that directory. (This is a modified version of a bit of
code
posted to an
Ubuntu Forum
by '
MartinSzyska').
While you're at it, download this
BibTeX.js
file into the same directory, which will make sure that your
in-document LaTeX references (e.g., \citep{messing2009Bias} ) match
your bibliography key entries when you export the full
bibliography. [UPDATED FOR ZOTERO 2.11].
[ Or if you'd rather modify the javascript yourself or prefer a
different reference id naming scheme, you can open "BibTeX.js" in a
text editor like Notepad++ or Xcode. The line to change is:
var citeKeyFormat = "%a_%t_%y";
For example, I changed it to
var citeKeyFormat = "%a%y%t";
where %a is first author, %t is first word from title, %y is the year. ]
Then restart Zotero.
After you restart Zotero, set "BibTex CiteKey-Only Exporter" as the
Default Output Format in the Export preferences pane. Now you can
select a reference from Zotero and drag it off the screen into a
waiting text editor (e.g., Eclipse). Alternatively you can use
Cmd+Shift+C to copy the \citep{key} to your clipboard.
When you are ready to export your entire bibliography, right-click or
control-click on the folder and select "Export Selection." I recommend
downloading this .bib database file to the working directory for your
Sweave/LaTeX project.