INTERVIEW

Six Questions for Jonathan Carroll, author of Beyond Spreadsheets with R

By Frances Lefkowitz

Jonathan Carroll is a data science consultant providing R programming services. He holds a PhD in theoretical physics.

__________________________________________________________________

Save 39% off Beyond Spreadsheets with R. Just enter intcarroll into the discount code box at checkout at manning.com.
__________________________________________________________________

Absolutely! In fact, I bet you’re not giving yourself enough credit about the spreadsheet. If you’ve ever added two cells in a spreadsheet then you’ve already been programming. You might have even used the keyboard for this, e.g. =A1+B1. Beyond Spreadsheets with R takes that level of starting point and guides you to using a programming approach to working with data. Once you understand the basic structures, you’ll be able to calculate the average of values, plot them, and discover patterns within them in a reproducible way.

If you’ve been working in a spreadsheet then you’ll be able to formalize all of that. This means you can construct a processing script which performs operations on data, and you can apply that script to different versions of your data without having to redo everything every time. This means less “click this button then select these values,” and more reliable/reproducible processing. Once you know R at the “end of this book” level, the possibilities become endless. I’ve used spreadsheets as a starting point for data, but that’s hardly the only source of input R can work with. With the same tools, you can collect and analyze Tweets, manipulate images, or connect to other systems to do processing. You can even work with R to do things that aren’t “data,” such as build a blog, generate art, or even connect to another programming language.

I’ve often been employed to work with R to transform data reliably from one source to another. Real world data is usually messy (missing values, misspellings, odd formats), and while a spreadsheet is a nice way to interact with that data, it’s not productive if you need to do those operations many (maybe thousands) of times. I’ve built models to estimate statistical quantities based on other people’s data, and built tools to help myself and other people inspect the raw and generated data, either as tables, graphs, or reports. I’ve worked with data on fisheries, the electricity market, sports betting odds, and teaching surveys. At the moment I’m working with genomics data to help scientists look for a way to treat cancer. I wrote my book in Rmarkdown, which means you can trust that all of the code output actually derives from the input. I’m also writing my new blog in Rmarkdown and publishing it right from RStudio.

The basics for me are “how do I get this data into my analysis?” and “how do I clean up this data?” These are absolutely covered. As for what sort of analysis you might want to do to that data, I have a small preview in the book, but it’s so dependent on the end goal that it’s worth following up later with a more specific resource. The old saying goes that “data cleaning is 90% of data science; the other 90% is doing something with the data,” which hints at the fact that cleaning the data often takes a lot more effort than first assumed.

If you’re trying to decide if a programming solution is right for your task, consider how many times you’re going to need to use your workflow. If it’s just once, then you can maybe get away with your spreadsheet. If you need to do it twice — and that may just be that you need to revisit the one time you did it to confirm it’s correct — then a programming solution is going to be of benefit. Plus you’ll be able to share what you’ve done, with a colleague, a client, a supervisor, or yourself in six months.

I’ve been brewing for about fifteen years now, from grain for about half that time. It’s a wonderfully complex process if you really get into the science of it all (water chemistry, thermodynamics, yeast biology …) and the end result is deliciously rewarding. South Australia gets quite hot so I lean towards ales unless I really want to overwork my lagering fridge. I’m not a big fan of fruity beers, though fruity hops, certainly. But I once made a vegemite stout, so who am I to talk?

Originally published at freecontent.manning.com.

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store