Bob Muenchen Muenchen is the author of R for SAS and SPSS Users, and co-author of R for SAS and SPSS Users and An Introduction to Biomedical Data Science. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in data science software, reviewing such software, and helping people learn the R language.
Tell me about your company r4stats.com? What is your role? What is your day to day like? What are some of your favorite aspects of your job?
The company is a one-man operation, which gives me total flexibility. In the early years, I focused on providing statistical consulting services. That was a lot of fun, but the commercial software I used at the time – mostly SAS and SPSS – is very expensive, so I limited my potential clients to those who had their own licenses.
When R came out with its free and open-source license, I started using it immediately. It gave me the freedom to work with any clients. However, its extreme flexibility makes it a bit hard to learn, and it took me longer than I expected to get used to it. I kept detailed notes on the differences between R and SAS/SPSS and shared them online. They were soon getting thousands of hits per month and soon my email was filled with migration questions. Companies were wanting to know the best way to migrate to R. I ended up writing the book, R for SAS and SPSS Users, as a result. When that came out, my clients started asking for R training more than statistical consulting. Stata users had similar conversion questions and soon I wrote a similar book, R for Stata Users, with Stata Guru Joe Hilbe. Some of my suggestions on Helping Your Organization to Migrate to R are summarized on my website.
This whole time I had a “regular” job too, managing the Statistical Consulting Center at The University of Tennessee. These two jobs went together well. Anything I learned on one would usually help me in the other. UT has a generous vacation policy, which gave me a time to do outside work. I also worked with non-UT clients a lot on weekends.
At the end of 2019, I retired early from UT to focus on my writing and to help with some open source projects. I worked with Dr. Bob Hoyt and several co-authors on a new book, Introduction to Biomedical Data Science. That is one of the first textbooks that focuses on the application of data science to medicine and healthcare. Frustrated by the slow rate of change allowed by traditional publishers, we published that book through Lulu.com. We hope to come out with a new version of that every year to keep up with the rapid rate of change in data science and how it is applied medically.
Throughout my career, I’ve worked with two very different types of analysts: data science types who were very comfortable with programming, and other scientists who did not have the time it takes to become a good programmer. For awhile this meant that I was using R on projects with the former, and SPSS or JMP for the latter group. That meant that I was still dealing with the expense of commercial software. Then around 2012, several point-and-click menu-based front-ends to R started to appear. I hoped that one of these would allow me to use the same tool with both types of clients. I now track the progress of eight different front-ends to R. I have written extensive reviews of each on my website, which I update with each new software release.
How did you first get into Data Science/Statistics?
I got my undergraduate degree psychology from Bradley University. That involved a lot of statistics and research methods classes, which I found interesting. After graduation, I went from there to Arizona State University to work on a PhD in Educational Psychology. There I took several more statistics classes and had an assistantship doing statistical programming in SPSS and BMDP. I also took a class on Industrial/Organizational Psychology, which emphasized the role of research in designing user interfaces. I got extremely interested in that, and left after a year to pursue a PhD in I/O Psych at UT. By then I had such strong statistics background that they sent me off for an assistantship in the Statistics Department. After a year, it became clear that if I majored in statistics, I could use it to study a broad range of areas, so I changed majors yet again. After four years of grad school, I ended up with a Master’s degree since the program did not offer a PhD at the time.
I worked as a stat consultant on a half-time assistantship, in the UT Computing Center’s Statistics Group upon graduation. That team changed names several times over the years, but it always focused on providing free statistical consulting to all departments at UT.
What data science tools (R, Python, Hive, SQL, SAS, Hadoop, Etc) do you use most? Which tools do you recommend? Which are outdated?
This question hits upon one of my personal obsessions! Shouldn’t data scientists always be wondering what’s gaining in popularity, and what’s declining? I’ve seen many computing paradigms come and go in my career – mainframes, mini-computers, pre-Linux Unix servers, and expensive Unix workstations. I’ve seen people become so obsessed with a particular tool that they hang onto it too long, resisting change and damaging their careers. So I track, The Popularity of Data Science Software on a web page that I update regularly. There I gather my own data on what employers are looking for, and what is being used in scholarly articles. I also summarize market share data from a variety of other sources. It’s one of the most often read pages on my website.
I use R the most, through RStudio, BluSky Statistics, and jamovi. I also use a bit of Python and KNIME. I’m a fan of SQL, though I rarely use it directly since its syntax is so different from R’s. The dpyr package in R, from Hadley Wickham & friends, is essentially SQL converted into R syntax form. I use dplyr and the rest of the tidyverse functions many times per day.
A data science question that greatly interests me is: can you do data science in a graphical user interface rather than code? No one questions that the use of code offers the most power and flexibility. But what about menu-based systems like SPSS or JMP, and workflow-based systems like KNIME, RapidMiner, or SAS Enterprise Miner? Are scientists who analyze data using those tools “doing data science?” That depends on how tightly you define data science. They’re certainly doing statistical analysis and doing it in quite a lot of scholarly articles, as my data shows.
For a couple of years, I thought that workflow-based tools would replace the menu-driven style of interface. However, I no longer think so. They are just enough more complex to prevent that from happening. In menu-driven interfaces, there is very little to learn up front. Every task is done one after another, determining the order, and reports appear in that order; simple! After the fact, the order can be often be shuffled, or analyses deleted, using a table of contents feature. On the other hand, workflow-based interfaces require you to learn how to connect the various nodes in the flow (each node has a dialog behind it just like menu-based ones). Those nodes allow work to flow in two dimensions, making it inherently more complex. Plus, as you connect each node in the workflow, data may flow, or a model may flow, adding further complexity. Finally, with a two-dimensional flow of work, what determines the final report? There are report nodes, but they too add complexity.
My current thinking is that data science tools have three levels of difficulty that are likely to be with us for quite a while: menu-based for simplicity, workflow-based for more power, and code-based for the ultimate tradeoff of ease-of-use in exchange for power.
I see your are a PStat (ASA Accredited Professional Statistician). Can you tell me the process of becoming accredited? Would you recommend data science students to become PStat members?
Applying for PStat certification consists of compiling lists of relevant coursework, sending examples of publications you’ve written, and some letters of recommendation from statisticians who know your work well. It takes around five years to compile enough of a track record to meet the criteria.
Data science is such a broad field that the value of getting the PStat certification depends on your training and your goals. For a data scientist with a PhD in statistics, I don’t think it would be very important. But for any other data scientist who is going to be applying statistical methods that emphasize probability and confidence intervals, I strongly recommend it. Over the years I have interviewed data scientists who could describe the details of, e.g., the GBM algorithm, but who had trouble explaining statistical basics such as random vs. fixed effects, or what a two-way interaction is. I’ve never met a PStat who did not know statistical concepts thoroughly, regardless of their university major. Every person I hired when I managed UT’s Statistical Consulting Center started out with a list of goals to aim for, and getting PStat certified is always one of them.
From your vantage point, given your career and your involvement in the field for a while now, do you think there is a shortage of knowledgeable Data Science professionals and why or why not?
The demand for data science jobs has grown over the years to a level that amazes me. When I graduated in ’82, the only places I could get a job in the Knoxville area were UT, TVA, and Oak Ridge National Labs. Now there are jobs throughout the area. Companies finally realized that those that those who continued making decisions by instinct instead of data science were doomed to lose in the long run.
One of my projects for a large company entailed developing a model to predict the revenue from various potential new store locations. The staff member who gave the presentation to the board of directors was only about fifteen minutes into it when all hell broke loose. Unknown to us, the person whose job it was to use his “keen instinct” to choose locations had fudged the data to make our model look bad. Instead, there were board members present who knew the variables we said were most important were not, and they caught on immediately to what had been done. They kept using their instinct-based method, and the company folded a couple of years later.
A colleague of mine did forecasting for another major corporation. They had been using vague “guestimates” of how much of each product to stock. That meant that they ran out of popular products, and had to sell the unpopular ones they had overstocked at a loss. When he was hired, he introduced time series models, and he tracked how much money that saved, which was very substantial. Later, when times got tough, they laid off almost the entire forecasting group, saying they couldn’t afford them, even though the amount they saved was far greater than the cost of their salaries! That company is gone now too.
Jobs might be tight until this Corona virus outbreak passes, and the economy recovers, but the future of data science jobs is very bright indeed.
How do you explain your work to people outside of the field, or to people that don’t have a background in Data Science?
I tell them that I help people make good decisions. I use the example of a new drug. The company hopes it’s better than the old drug, but is it? If the new drug instantly cured everyone with no side effects at a reasonable cost, the decision would be easy. But we all know that some people will do better, and some people might do worse. From all the pharmaceutical TV ads, we’re all too familiar with gruesome side-effects. Then how do you decide? I use data science tools do the calculations and provide probabilities about how the overall population will react to the drug. Most people get that right away and start asking questions about decisions they face, which leads to some fun conversations.
What do you think is the most exciting thing happening in the data science research realm?
I expect this answer will come as a surprise to your readers. There are an incredible number of interesting things happening in data science, but the one that excites me the most is the GUI-based reproducibility and reusability. Reproducibility has been a hot topic in recent years for a good reason. If we’re going to make claims about the nature of some aspect of science, we should be able to reproduce every step of it. Until recently, only code-based solutions – especially those documented with Markdown – could provide reproducibility.
I’ve had many grad students ask me why their menu-based stat software gave them two different answers for the “same” analysis. With no precise record of what was done, there’s no way to know. Even if they bothered to save the code generated by systems like SPSS, it’s code they don’t understand since they did not write it. However, in the past year menu-based systems that record every step and can replay it (reproducibility), or modify it, perhaps for use with new data (reusability) has been released by jamovi, JASP, and SPSS. BlueSky Statistics is also working on a version that includes it. This feature doesn’t get the attention cool new algorithms get, but I think it will improve the work of very large user base.
I’m also keenly following the progress on the Julia language, which aims to be both easy to use (for a programming language) and very fast. Much of R is actually written in C or C++ behind the scenes because it’s so much faster than the R language itself. So if you want to see how R is doing something, perhaps to change it, you’re diving into a different language altogether. It would be nice to be able to have a single language that was both fast and easy. Julia is making good progress, but it’ll take quite a while to catch up with the competition.
What do you think about when you’re looking ahead into the next five or 10 years of Data Science, how do you think things will change? What do you see as big opportunities, especially maybe framing it from the perspective of students that are kind of just beginning their careers and thinking about their educational opportunities?
One of the biggest long-term trends will be increased automation. I’m particularly fond of the way Max Kuhn’s tidymodels package helps automate machine learning. By doing data preparation using functions like, “step_YeoJohnson(all_numeric())” that find the optimal transformation of a potentially large number of variables without having to even name them all, and change names from dataset to dataset, really speeds my work. I think tidymodels is a good tool to use to learn the various ways modeling can be automated. I expect many tools will be far more automated, such as BlueSky Statistics’ single dialog for cross-validated model tuning, but for data science students, it would be better to see the steps of that automation without having to get down the low C++ level of coding. Tidymodels offers a nice view into that world.
In the context of us talking about data science, our audience is students or early career professionals, trying to enter the field. What’s the best piece of career advice you’ve gotten or that you give?
I recommend having the perspective that every tool will become obsolete and be replaced with something better. Make sure you see that coming so that you’re not replaced along with the tool that you just can’t let go of!
In addition, the choice of data science tools is too often like the choice of religion. All the data science tools have their strengths and weaknesses, and criticizing the tools that meet the needs of others is rarely beneficial to your career. At UT, our clients that are doing mixed-effects models are often surprised to see R choke on a complex model, then see us switch to SAS to get a solution. Just because something is old, doesn’t mean it’s not good!