Fascinated by calculus as a teenager, Josh Wills went to Duke University to study theoretical mathematics. In his final year of college, he was introduced to a subject called statistics, and although he preferred the latter to partial differential equations, he did like it from that moment on.


After that, Josh went to IBM for a short time and then went to the University of Texas at Austin to do a Ph.D. in operations research, where he focused on solving NP-hard problems. After that, he moved into the startup world, working as a statistician at Zilliant, then Indeed, and eventually Google.


                                                        

                                                                        Josh Wills

Director of data science at Cloudera

In this interview, Josh talks about the fascination of the intersection of literature and data science, the humility and desire to learn, the effort to explore open source projects, and the profound influence of Google’s engineering development department. Josh Wills is now Cloudera’s senior director of data science. His job there is to, as he puts it, “make the data great.”

                                                       

1. As a start, we’d like to hear about your undergraduate experience. After your undergraduate degree, you went straight to the academic route and went to graduate school. How did that experience lead you to where you are today?

I majored in mathematics as an undergraduate. But the funny thing is, although I was good at math from childhood to high school, I never really liked it. I was more interested in history or political science until I took calculus in high school. I fell in love with it and thought calculus was the first interesting thing I had encountered in my academic field.

I was a straight-a student in high school, and no one else in my class could keep up with me. I even taught myself to take college exams with hardly any lectures. In the first few years of high school, I took the college exams for political science and comparative Government without ever auditioning a single class, and I did very well.

So in my later years of high school, I did the same thing, passing art history, economics, and physics. Then I took all the calculus AB and BC courses in one vacation, then multivariable calculus, and finally linear algebra, all on my own. By that time, I was completely conquered by the charm of mathematics, like standing in front of a beautiful painting, appreciating an art.

I ended up at Duke. One of the great things about Duke was that I could take all the math classes I wanted. My first course was on topology at the graduate level. It was a very interesting class, and there were a lot of very good mathematicians in the class with me.

Obviously, this course was too far ahead for me. Although I was outstanding, everyone else was a graduate student, so I was very modest at that time. I think everyone has to have a day of self-doubt sooner or later, and I was lucky enough to think that this day came early, in my first year of college, so I had plenty of time after that to slowly rebuild my confidence.

Anyway, I was into math, and I always thought I was going to be a math professor. But I was also interested in a lot of other things — I took philosophy, economics for a while, and then got into cognitive neuroscience after that.

I was lucky enough to go to Carnegie Mellon university in the summer of my second year for the Undergraduate Research Program, which was designed to work on road modeling and spatial navigation. It was the first time in my life that I got to know the programming tool Matlab and built a model to simulate the work of the human brain. It was that experience that got me into programming.

2. Did you start taking various programming classes at Duke right away?

Yes. I took a computer science course at duke university and learned to write C++ programs. But I didn’t take computer courses like algorithms or operating systems. Later in my career, especially in the interview process, I found that the lack of courses brought me great trouble, and they also became some embarrassing thresholds.

At the beginning of my third year of college, I decided to take a break from graduate school and pursue the academic route, instead looking for a real job. I interviewed with a few startups and got an Offer, but all of that was crushed by the bursting of the dotcom bubble in 2000 and early 2001.

This was common at the time, and duke’s job search department was very helpful in helping our hapless group find jobs everywhere. I ended up getting a job in IBM’s Austin office. My first day on the job was June 17, 2001, but a week after I started, IBM announced a hiring freeze, so I felt like I was the kind of person to get in on the sideline.

IBM has a hardware team based in Austin that focuses on chip design and system composition, which in short means that you need to drill into very primitive hardware components to catch every bug so that you can load and run the operating system.

I was in charge of a MySQL database for testing microprocessors. The entire database was 15GB in size, which seemed huge at the time, but now looks woefully small – my phone has more capacity than the entire database! My job was to develop the dashboard and then do some statistical analysis of the performance of the machine and the performance of the chip in order to determine how fast the chip could run based on some of the measurements made during chip manufacturing. So this is very traditional statistics, very traditional data analysis, just a little bit of programming.

In fact, it was such boring work that I soon got bored. I’ll look back on it later and think it was a miracle I was able to produce anything under those circumstances. It was a testament to my ability to do a good job whether I liked it or not.

Bored with that job, I applied to the Operations research program at the University of Texas at Austin. There was no school of statistics at the University of Texas, and statistics was what I wanted to study, so the Operational Research Institute was the best opportunity I could get in Austin, which was a very good city at the time.

I only took statistics in my final year of undergraduate study, but that was the year I left school. That year I took music appreciation, an introduction to logic (oddly enough, it was a philosophy course), and an introduction to statistics. Introduction to Statistics was actually a graduate course at the time, but it was easy for me, given my background in linear algebra and partial differential equations.

And the funny thing is, I liked it right away. Many of the philosophy and neuroscience courses I’ve taken have cognitive theory and symbolic reasoning, and the whole point of those courses is to try to understand how we can be sure that we know something.

3. Statistics is about quantifying uncertainty and what we don’t know.

Exactly! This is a subject that quantifies knowable and unknown. Here’s your data. What firm conclusions can you draw from it? I’m very interested in that. Personally, this stuff drives me crazy. I like statistics. Now fast forward to my time at the University of Texas, where I took a full operational research course.

During those two years, I had to take three courses a semester to get my master’s degree while working for IBM. It was a terrible idea, a total disaster, and I had no life at all.

4. It sounds like you learned how to do simple statistical analysis at IBM and thought, “I’d like to learn more about that.”

Yes, that’s it! My software engineering requirements at IBM were pretty simple, and I wrote a lot of crazy Perl scripts that partially automated my work. But that place taught me the basics of statistics and made me realize that statistics are actually more useful in the real world. And I thought, if you want to learn more, school is the best place, so I went back to school.

During my first year of graduate school, I made another switch: I moved to another part of IBM so I could do some “real” programming instead of tinkering with dashboards and writing Perl scripts. The department I moved to needed to do some very low-level C++ firmware programming. That job basically involves writing firmware programs for hardware systems that don’t yet work properly. As part of a group, I started learning things like version control, testing, etc., which I had never learned in school.

The most important skill I learned was how to debug black box systems. I wanted to get a firmware running on a hardware system that wasn’t working yet, and it was my job to figure out how to fix all the problems and make it work, and whatever bugs I had, I needed to fix them. I didn’t know much about hardware at the time. I don’t know much about it until now. I can’t even program a tape right now. I think I ended up becoming a software engineer because I didn’t understand systems THAT I didn’t design myself.

In short, the black box is the hardware that doesn’t work. When I give it some input, it doesn’t give me any output. I had to hack into the system, use some command or instruction to get the hardware to communicate with the rest of the system again. And in the process, that experience of tuning into a black box was probably the most important skill I learned in this place.

5. What did you learn from your experience debugging these black boxes?

I don’t find the black box mysterious: I’m just fascinated. I was the kind of kid who could spend five or six hours playing with Legos. I still love Lego to this day. I was born in 1979, which is probably the year of the millennium. To me, if a computer system doesn’t do what I tell it to do, that’s not an acceptable situation.

No matter how long it took, I was determined to fix the black box until it did what I told it to do.

This is a subject that quantifies knowable and unknown. Here’s your data. What firm conclusions can you draw from it? I’m very interested in that.

I’ve had several good questions like this in my past. A good question is that no matter how good your level is, the question is still higher than your level. It’s a good feeling that you’re trying to do something a little more demanding than your level. I tend to get caught up in it. But at times like this, my relationships tend to fall apart because I can’t be interested in anything outside of work.

There was a time when it was fashionable to interview for a job as a data scientist and ask candidates to actually analyze data during the interview. I can’t agree more with this kind of behavior. I once had an interview where they gave me a question and a data set and asked me to sit down for two hours and do the data analysis. It was probably the happiest two hours I’ve had all year. I could go on more interviews for that feeling.

6. But you have mentioned that there was a difficult time in your academic experience. One of the things about academics seems to be that once you reach a certain point, you can spend a lot of time on the cutting edge. Your personality seemed to be good at it. Why didn’t academia appeal to you later?

As a pseudo-millennial, I’m not just a millennial, I’m just as impatient as they are. I’m starting to find the academic world no longer appealing to me. I need to do too much before I get to the point you’re talking about where I can spend a lot of time attacking a short question.

Once you’re in graduate school, you need to work for a mentor and do exactly what your mentor tells you to do, and do a lot of the things he tells you to do. And then you have to do a postdoc for many years before you have a chance to become an assistant professor. After all the pain you’ve been through, you’ll be tenured in another decade.

That’s a long time, and then you have a chance to spend time doing what you want to do. Even as a professor, I don’t think it’s worth it, because you still have to spend a lot of time applying for grants and managing students and postdocs.

Now I’m 35 years old. In terms of timing, I’m probably at the peak of my career. I have a great job where I can do everything I want and love. But it’s also an opportunity for careful consideration. Being in a position where you can do whatever you want can be stressful, because if you screw up a project or miss an opportunity to make a big impact, no one will take the blame.

Amr Awadallah once wrote a post about what a CTO should do. He compares the responsibilities of a CTO to those of a CFO. The CFO is the kind of person who is not responsible for each quarter’s sales numbers, but if there is a major error or omission, the CFO can be fired. At the same time, the CTO is not responsible for bringing the product online; that is the responsibility of the VICE president of engineering. But if the CTO misses some big tech wave, he can go.

7. Can you talk a little bit about the difference between IBM and Cloudera? How did you find the differences?

Our conversation skipped graduate school, where I took a course on price optimization. One of the professors also works for a local Austin startup called Zilliant. I wanted a job focused on operations research, so my professor hired me as a data analyst. There, I learned SAS and R, and began to do data analysis and modeling for problems such as market segmentation and price elasticity.

When you come out of academia, you generally find that the real world is more interesting than it looks, and that the problems you have to solve are much harder than they seem.

When you come out of academia, you generally find that the real world is more interesting than it looks, and that the problems you have to solve are much harder than they seem. The reason pricing optimization isn’t as popular as software engineering is that most fortune 500 companies simply face the problem of making sure the price is higher than the cost.

If they can’t remember what the cost is, they can’t possibly know what price to set to break even. It’s not rocket science. You don’t need a data scientist to do that. All you need is a good report.

8. Why don’t these companies know this important information?

This seems like an important part, but the truth is they don’t know a lot. The problem is sales motivation. The purpose of the people who sell things, the salesmen, is to get contracts, because their income is based entirely on these contracts. They need to put together what they’re going to sell, pack it up and sell it. This process will require some materials and professional services, mainly documents and various contracts.

These documents will be read and refined, but no one will think about how much money it will cost to go through the hassle of signing these contracts. Such costs vary wildly, and people tend to estimate them optimistically. They don’t expect their business negotiations to be contentious and bumpy. They don’t reckon on mistakes in their report. They don’t take into account the possibility of hurricanes on their journey.

These aren’t trivial problems to ignore, but they’re still not the kind of problems that fit nicely into the various techniques you learned in graduate school. They are completely different kinds of problems.

They are indeed small problems, small but not easy to solve. Just as losing weight is a simple but not easy job. The problems of most companies in the industry are simple, but difficult to solve.

9. So after Zilliant, did you make solving industry problems your goal?

I want to make myself useful. I like to solve other people’s problems. I also hope I can help others. I’m very helpful by nature. I do have a passion for abstraction, and I love art and other weird aesthetic things, but I would love to be able to focus more on people’s problems every day and make their lives better. Aesthetics and theory are not as attractive as the former, because they always take me away from real problems.

10. Before joining Google, you worked at a number of startups. Are you solving different problems at these startups? What ultimately made you choose Google?

The Google job took me out of Austin for good. I was so afraid of leaving Austin that I could make a long list of opportunities I turned down. In 2005, I was offered an engineering analyst position at Google, which I turned down. In 2007, I was offered a job as a data scientist at Facebook, which I turned down. I’m still trying not to think about what would have happened if I had accepted the Offer.

They are indeed small problems, small but not easy to solve. Just as losing weight is a simple but difficult task, most of the industry’s problems are simple but hard to solve.

What finally got me going to San Francisco was auction theory. I took some of my Ph.D. courses in game theory and mechanical design at the University of Texas, including auction theory. I really like it: it’s a beautiful mathematical model, and it can be used to optimize a lot of social problems. I always wanted to know what auction theory would look like in the real world, but there was never any opportunity for me to actually design an auction theory problem in Austin.

Luckily, I’ve always had a connection with Diane Tang, who tried to hire me to Google in 2005 to run its AD quality team, which is responsible for advertising auctions.

She was Google’s first and only female Google Scholar, but at the time she was just a friend who recruited me to work at Google and run the AD auction full-time. She is my mentor and friend, and the most important person in my career.

11. What is Google’s AD quality team like? Is it a place for a bunch of people who have studied auction theory in order to apply it to the real world?

The point I want to make about Google is that it has too many software engineers at its core. Eric Veach, who has a PhD in computer graphics but no experience in machine learning, designed Google’s original machine learning system. Eric had this problem, went to the book, and came up with the whole solution.

I remember when I first came to Google, trying to understand how the whole machine learning system worked. It was a really clever and unique solution to the world’s first truly large-scale machine learning problem.

The original algorithm is very clever, and I’ve never seen it published in a journal, and I don’t think we’ll ever publish it. Of course, Google has long since moved beyond that stage and into smarter machine learning systems.

Eric is the guy who designed Google’s original auction algorithm. At the same time, he was a graphics guy who had never heard of auction theory. So he read about the auction theory of second prices and came up with a very simple generation algorithm called GSP — generating second price auctions.

I dabbled in a lot of auction-related issues and ended up at Google. I really like Google, but at the end of the day at Google the questions we do about auctions are just as complex as what people understand about auctions.

It’s great that advertisers are enthusiastic right now, but the reality is that really interesting auction strategies and auction models are very complex and require a lot of computation, and we need a very strong engineering team to come in and see what we can do. But Google doesn’t want to have such a complex auction model, because no one will like it except those who work on auction model theory.

12. This sounds like a significant difference between academia and industry. In academia, you always try to get the best results. But in the real world, you will find that the highest priority of development implementation is not only metrics, but also usability and user experience. Was the transition difficult for you at first?

I don’t see what the problem is. I was very lucky. Most of my graduate work in operations research was on impossible problems. Operations research involves some very basic and difficult questions that you can’t possibly answer. The purpose of that kind of research is to get you to be the best you can be, and I actually like that kind of work because it’s not very demanding. If it’s an unsolvable problem, and you get an answer, even if that answer is far from the optimal solution, that’s interesting.

13. There’s a joke: “If you have an NP-hard problem and you do something slightly, your solution is probably exponentially better than anyone else’s.”

I couldn’t agree more with that statement. This is a tolerable line of inquiry. Operations research is a relatively applied discipline in science and academia, so it was not difficult for me to switch to the industry.

14. Your story shows that in order to become a data scientist, you have to be somewhat self-defeating. You have to be willing to step out of your comfort zone and step into an area that you’re barely familiar with, starting with new students. How good are you at programming and development?

I don’t think I’m very good at that. In school, I was pretty good at writing algorithms and optimization models, but I was never a great team programmer. Even at IBM, although I was part of a four-person development team, we didn’t really have to work as closely together. The architecture of the software is already defined and the interfaces are well understood.

When I was at Zilliant, the company decided to rewrite their pricing engine. The data analysts got together and wrote a spec for their new pricing engine. It required some programming expertise, and by that point, I had been writing code for IBM for several years. So I volunteered to do software development, but pretty soon everyone realized THAT I had no experience building a real software product from scratch.

I give Zilliant’s manager a big thumbs up for what he did at that time: he assigned me a senior development engineer as a mentor, John Adair, who was another mentor to me. Three months later, he wrote a new system based on that specification, and I unit tested it. During those three months, I was responsible for writing unit test modules every day and using them to test John’s program.

It was one of the most effective learning experiences of my career, because John wrote beautiful code. When I describe this experience to people, they dismiss it because it sounds boring and bad, and most developers probably don’t like writing test units. But if you’re testing your own masterpiece and you spend a whole day reviewing your own work, it’s a lot more fun from that perspective. And I finally learned how to develop a software system from scratch.

When I describe this experience to people, they dismiss it because it sounds boring and bad, and most developers probably don’t like writing test units. But if you’re testing your own masterpiece and you spend a whole day reviewing your own work, it’s a lot more fun from that perspective.

As system development got into the back end, I started writing about the design as well, so I knew both the specs and the software. It’s interesting to see firsthand how to write code that passes tests. John and I did several refactorings of the system, but when the test department tested the entire system, they only found two bugs. This is one of the best software projects I’ve ever worked on, and the code is just beautiful.

After leaving Zilliant, I had a brief stint at Indeed, working in the search engine division. There, I was a statistician. I write some code, but mostly I use my knowledge of statistics there. And when I left Indeed to join Google, I was still employed as a statistician.

Google is full of beautiful code that you can read, use, and learn. After nine months at Google, the company changed my position from statistician to software engineer and gave me a promotion. I’ve always been guilty of this, because my code quality has never been checked by Google’s internal systems.

For someone like me, I’m just good at imitating and learning quickly whatever it is. There’s so much great code in Google, and it’s been an amazing experience for me. Just because I used to work at Google and saw how the best people wrote code made me 20 times better than the average software engineer. It was an absolutely incredible experience. It was fantastic!

15. Could you describe specifically what you do at Google? Do you ask people who write code questions, get answers from them, copy their code, and read their code? How did you study there?

I don’t know about the rest of the world, but Google really instilled it in me. It required me to write code like everyone else at Google. Readability requirements are a big problem. No matter what language you’re writing, your code must meet certain readability requirements before it can be added to Google’s library, or your code can be approved by someone who’s qualified to check readability.

In order to make your code readable enough, you have to write a lot of code in Google’s code style, and readability checks are considered a software engineer’s nightmare. I’ll never forget what the Sawzall library said about the readability of my code.

I wrote some code to analyze the AD log, looking at the correlation between some AD targets and different machine learning probabilities. I wrote some very simple correlation code and submitted it to the Sawzall library, and the person who reviewed my code was Rob Pike.

You may not know Rob Pike, he was a member of AT&T LABS. He wrote The Ninth Plan, and he was one of the founders of Go. He also created the Sawzall language. He’s also one of the toughest code checkers I’ve ever met at Google, and I’m sure he’ll take that as a compliment.

The one time he reviewed my code, I had to go through 26 reviewers to get it approved, and it was a horrible experience. Things got so bad that I was even thinking about quitting. There were so many, so many, so many mean comments. I think that’s the great thing about Google, that they force me to become a better programmer by making me pay attention to all these little details. No pain, no progress.

16. That’s probably one of the beauties of being a data scientist. It’s a cross-disciplinary field, so if you’re particularly good at one area, you can humbly feel like you’re just getting started in other areas and ask, “What can I learn from this person?”

I think this is a very important item on the list of responsibilities as data science. The truth is, learning all these things is a long, winding process. For those software engineers, the former person might give me a tough code review, but the latter person might come up and ask me some data analysis questions, because they know I’m a statistician, someone who understands their jargon and can explain things to them.

It’s hard to be humble, but remember that humility will eventually lead to progress. One day if you can become an expert, an expert who can speak their own things fluently, that would be great.

17. How do you go from a giant company like Google, which is sort of a research institution, to a startup like Cloudera?

There are a lot of things I miss about Google. I miss the people there. I miss the food. I miss those toys. There are too many great minds in Google. Our data science team at Cloudera is really an extension of Google, and we’ve brought all the things we love about Google and made it an open source version.

That’s what we do. It’s the easiest product management solution in the world. Learn what you love and then refine it.

Just because I’m not the best programmer in the world doesn’t mean I can’t contribute something, and the community and Crunch support that comes with it is something I’ve always been proud of.

When I got to Cloudera, there were about 85 people in the place. It wasn’t a startup anymore, but it was still small. When I went in, I said, “Hi! Hi, I’m the new head of data science, what should I do?” Nobody had any ideas, and I didn’t have any ideas. I didn’t know what they were hiring me for.

There were days when I worried so much about it that I felt like I was doing nothing at all. At Google, I get about 150 emails a day, all from people who want something from me. But here, I can literally retire. That’s the anxiety I was talking about before.

So my job at Cloudera was to figure out what I could do. I spent a lot of time talking to clients, and I still do today. I give them advice on how to build data science teams, or give them solutions to specific problems.

At the same time, I started working on some of the questions that clients had raised, or some of the interesting and useful things they needed. I was new to Hadoop at the time, so a big part of my job was to learn what Hadoop was and how it worked. I remember one time I wrote a model to detect drug side effects, and the algorithm was originally written in a non-MapReduce way, but it was definitely a perfect MapReduce problem.

It was the first thing I did that worked, and I knew it worked because Mike Olson, one of our co-founders, showed my results for five minutes at a conference, and there were a lot of reports and Twitter comments about it after that.

After that, I worked on some issues about processing seismic image data, the kind of time series data that oil and gas companies analyze to help them find oil and gas storage sites underground. I really missed FlumeJava at that time. It was an excellent tool for solving this kind of problem, so I almost rewrote FlumeJava to solve my problem.

That time took me back to my days at IBM debugging black boxes. When I first came to Google, I used FlumeJava to write data channels, so I knew what the API looked like, but I didn’t really know what was underneath that black box, just how to make it work.

FlumeJava published a paper on that system, and that paper was really useful, but you still have to sit down and say to yourself: “Ok, I know these apis work this way, but I don’t know why it works that way, so let’s sit down and take a closer look at what’s going on inside and what makes it work.”

Maybe I’m a little self-deprecating about my open source work, but thanks to my time at Google, I’ve let go of all my self-belief about the quality of my code, and I’m happy to put it out there for everyone to see and work on and make it better.

Just because I’m not the best programmer in the world doesn’t mean I can’t contribute something, and the community and Crunch following that comes with it is something I’ve always been proud of.

                                                        

Interviews with Data Scientists

By Carl Shan

In-depth interviews with 25 world-renowned data scientists from different perspectives and dimensions gather their wisdom, experience, guidance and advice into a book. Each interview is an in-depth exchange, covering the entire process of starting as a novice data scientist, arming and enriching themselves with various knowledge, and eventually becoming an effective data scientist. Through reading the interviews in this book, readers can form a macro understanding of data science, better understand and experience the role of data scientists, and learn valuable knowledge and experience from the past experience of these predecessors to apply to their own growth and career.

The book is for people who want to become data scientists, people who are working in data science, leaders of data science teams and entrepreneurs and business people, as well as general readers interested in data.

                                              

Long press the QR code, you can follow us yo

I share IT articles with you every day.

If you reply “follow” in the background of “Asynchronous books”, you can get 2000 online video courses for free

Asynchronous book benefits to send non-stop

Invite 10 friends to pay attention to get a book asynchronously within 10 days (click the text to get details of the activity)

Click to read the original article to buy Interviews with Data Scientists

Read the original