Filing Cabinet

Introduction

I pick up so much information and handy tips from twitter that the biggest problem I have is collating it all. So far, I tend to bookmark pages in a filing system on my bookmarks toolbar in google chrome. This kind of works but thought it would be nice to have more of a document that I could add notes to as an aide memoir for when searching in the future.

Inspired by this page by Katie Jolly, I have decided to create my own list of resources in a few themes:-

  • R
  • PhD resources
  • Other good resources

R

There are literally millions of online resources. Here are my favourites…

Books/Journals

  • R for Data Science by Garrett Grolemund and Hadley Wickham. The go-to book for so many of the most important data analysis techniques
  • Data Visualisation: a practical introduction by Kieran Healy. Initially entitled “Data visualisation for the social sciences”, this book hasn’t been published yet (Jan 2018) but the development version is jam packed with amazing stuff in an easy to follow format. Essential reading.
  • Practical Data Science for Stats is a PeerJ Preprint collection of non-peer reviewed articles on nuts-and-bolts-how-to-wrangle-data-stuff. It is curated by Jenny Bryan and Hadley Wickham. Nuff said!
  • Advanced R by Hadley Wickham. Again, an essential read if you are getting to the stage where you need to know/change what is happening under the bonnet.

Learning R

  • Data Camp. Some free courses. Otherwise about £20 an month. WORTH. EVERY. PENNY. Really easy to follow, short video explanations with a handy practical interface. Probably the best learning resource out there. Also has loads of resources for python, SQL etc.
  • learnR Github repo containing course notes for “R programming for behavioural scientists” taught at the University of Texas at Austin. Nice markdown arrangement of materials.
  • Rexams. A package for teaching R at scale. Ability to produce randomised exam questions, pdfs, etc. etc. Awesome resource if looking to teach R at University level (A future goal for me).

Explore

The most important tools/packages for exploring data are covered in the book R for Data Science under the chapter of the same name. This section shows a few additional tools that can come in handy.

  • The skimr package is my favourite for descriptive analysis. With a simple command: skim(data) you get a great first look at your dataset with counts, missing data, and even an in-line histogram if the data is continuous. On top of that assigning skim(data) to an object gives you a table of all values and calculations in a tidy format for further exploration. Best of the many summary functions available IMO.
  • I have come back to this stack overflow question and answer a few times. Basically a “how-to integer and modulo division” for adding years and month using lubridate.
  • pointblank is a one-stop-shop for pipeable validation functions. Handy for checking that variables in a dataframe behave like they are supposed to.
  • Missing data is always a big deal - there are two blogs that I always find helpful when trying to assess this. This one by Nicholas Tierney shows the various options available to visualise missing data in a dataframe. This one, I think by Rose Hartman, shows some of the options for multiple imputation. Nicholas has also authored the naniar package which offers some cool (and tidy) ways of imputing missing values described in this vignette. Another great missing data package I haven’t had a chance to try out yet is hmi. Built on top of the mice and MCMCglmm packages it has a great explanatory vignette.
  • This handy function written by Daniel Falster helps assign values using a lookup table. Explained in this post

Visualise

I tend to use ggplot2 and extensions for visualisation. There are a number of other options but let’s face it - this is what R is famous for. Again, basics are all described in R for Data Science & Kieran Healy’s Data Visualisation so just a few extra pointers here.

  • GGally is an awesome extension to ggplot2 with a number of handy plots. The most useful for me is ggpairs which allows you to quickly assess correlation between multiple variables.
  • I’m not sure if it is a dataviz faux-pas or not, but I often like to show a frequency plot with percentages drawn on as labels above the bars. This blog shows some options.
  • In this blog, Simon Jackson shows a really, really cool technique for plotting background data for grouped counts/histograms. An excellent, effective way to understand data.
  • I’ve also used this blog by Max Woolf a few times. This is for when you have grasped the concept of ggplot2 well and are looking to tidy your plots up to the quality needed for publication.
  • There are numerous ways to alter the theme of your ggplot. My favourite is to call the cowplot package at the start of any script/markdown and all subsequent plots use the beautifully clean cowplot theme. I tend to define my own colour palette with colours taken from the logo of my research base - UBDC.

Model

Once again there is a whole section of R for Data Science that deals with an introduction to working with models in R. Kieran Healy’s Data Visualisation also has a great chapter on models specifically exploiting David Robinson’s broom package.

This section points to a few other resources that I have come across/heavily relied on for different projects.

Latent Class Analysis

  • Niels Ohlsen wrote a really nice blog with a step-by-step guide to using the poLCA package.
  • There are actually tidier ways of using poLCA. I filed an issue on the broom github page a couple of years back requesting poLCA be a package supported by broom. It took David Robinson 4 days to sort it out. I was amazed that you could get that level of support on open source software! Anyhow - the poLCA_tidiers in broom are really awesome and the documentation (?poLCA_tidiers) has great examples of tidyverse-friendly code examples.
  • Also, this blog by Rose Maeir also has a nice walk through with great code for visualisations using both poLCA and mclust packages

Survival Analysis

Time series

  • Although still developing, I really like the tibbletime package and have found a few of the functions really helpful.

Communicate

One of the greatest things about R is the versatility of RMarkdown to help communicate everything you do. Again (broken record here), R for Data Science has a section on the basics. There is also a DataCamp course.

I’m just going to describe a little of how I use RMarkdown in RStudio (not including analysis -I tend to use html RMarkdown outputs for this - but that’s another story.)

Thesis

I am writing my PhD thesis using a mixture of RMarkdown and LaTeX. Unfortunately my thesis is in a private github repo for now as my University retains copyright for a year (I think). As soon as that is up I will be opening up the repo.

In essence I used this guide by Rosanna van Hespen and then started adding lots more LaTeX commands and packages as I developed the thesis.

It is worth noting that it is possible to write a thesis using the bookdown package as this blog describes. I personally found the bookdown default options quite rigid and so flipped back to Rosanna’s set-up. There is much more versatility (in terms of tables, figures, adding pdf appendices etc. etc) with the trade-off of spending a bit more time on technical research (googling LaTeX commands and packages). My version has a much longer yaml than the example in Rosanna’s repo with loads more LaTeX packages utilised.

The best output for the thesis is in pdf format which obvs requires a LaTeX installation. It is worth looking at the TinyTex package. This provides a minimal LaTeX installation saving a huge amount of memory on e.g. a full MixTeX installation. I have found it easy to install and add the required packages. There have been some difficulties in installing more obscure LaTeX packages but TinyTex itself is still in development.

If, like me, you have a supervision team that only use “Track changes” functionality in Microsoft Word for feedback etc. then this free pdf to docx converter is a lifesaver. I tend to print to pdf the chapter that I am looking for feedback on, upload it to this site then download the converted docx to send to the team. Works a treat.

Website

This website is rendered using the blogdown package and the hugo-academic theme. Changes are pushed to a Github repo. The website is then served via Netlify. Everytime I push changes to github the website automatically updates like magic.

The whole process is documented in this book. Be warned! This can be a frustrating process to get the hang of but once you have cracked it things are fairly easy to maintain and add to.

Database connections

Some of my analysis deals with some pretty large data files (up to 135 million rows of data). So far I have managed to deal with everything in-memory (I have 64GB of RAM available for my PhD projects). However, I am acutely aware that I am only one memory-heavy wrangling/modelling command away from crashing R completely. For that reason I have kept an eye on some of the developments that are being made with connecting R to databases via RStudio. A wee list of resources here:-

  • Fortunately RStudio have made a pretty comprehensive guide to working with databases. (Unfortunately I can’t use this for my analysis because I have to use an older version of RStudio server.)
  • There are loads of other options nicely debated on this RStudio community thread on the subject. Lots of these involve cloud-type virtual machines which (again) isn’t an option for me.

Program

  • purrr is a package that replaces the apply family of functions and things like for loops. It is part of the tidyverse and works to that philosophy. I’ll be honest and say I still haven’t fully grasped the whole concept of using purrr functions. Generally when I come across an iteration problem I try and google my way through it! What I should really be doing is spending a little time working through this tutorial by Charlotte Wickham, or the Datacamp course that provides an intro.
  • Writing functions using dplyr is hard! There is no getting away from this. The reason is the need to use tidyeval when referring to variables etc. Again, I am nowhere near competent in this however have managed to write a couple of function with tidyeval that actually work! This blog and this tutorial are both helpful, as is the programming with dplyr tutorial. Still - expect lots of head scratching! Another thing to add to the list of get-your-head-around. I can see the benefits of mastering it though!

Conference videos

The cool thing about software developers is that they are well up on tech! That means when they have a conference they know the drill with recording talks and getting them onto the internet easily! This link is to the talks given at the useR 2017 conference. A few real gems in there!

PhD Resources

Enough of R already!! My PhD aims to link together administrative health data from Scotland’s NHS and social care data collected by the Scottish Government (SG). This has been a great (and long) learning process!! There are a few links here that I always seem to return to.

Social care resources

  • If you want to know why social care (and social care funding) are important, this 20 min talk by Sir Andrew Dilnot is well worth your time.
  • Scottish Care is a body that represents the independent care sector in Scotland. If you are looking for background information on the social care sector in general then they produce some fantastic reports. I highly recommend “Bringing home care: A vision for reforming home care in Scotland” (2017) and “Home delivery: A profile of the care at home sector in Scotland” (2015).

Data

  • The social care survey is collected annually and the SG produces a report with aggregate stats. The aggregate stats can be used (with difficulty) for subgroup analyses. At the bottom of the linked page is a further link to open data releases of the 2010, 2011, and 2012 social care surveys. This is anonymised but has individual-level information. Unfortunately a lot of the data has been banded into categorical groups which makes it a little harder to really drill-down to what is going on (This was important to protect anonymity). Having said that there is still some interesting stuff to be gleaned as I hope this post shows.
  • ISD at NHS NSS are the go-to people for Scottish health data. The national data catalogue is a great resource. You can search the national datasets or data dictionary for what you are after. If you just want to browse stats then the new website harnesses R to give some great online visualisations and downloadable links.
  • There is a portal at this link to all the open data published by the Scottish Government over a number of areas (education, crime, housing etc.). Another great resource
  • Finally, a cool look at how SIMD is calculated is presented here with links to the associated Github repo and R code.

Other good resources

  • If you want to learn how to use Github effectively, Jenny Bryan is in the process of completing an online book
  • Any sort of software problem you are having is the same problem somebody else has had before. Stack overflow is the place to find out how to fix it. Especially R stuff. A few questions that I have starred as I keep coming back to them are here, here, here, and here

Related