By data science, I mean using data and software to solve a given industry problem1. By dispatch, I mean it in the software engineering sense: the task of choosing how to process inputs, depending on those inputs. By good data science is mostly dispatch, I mean most of what explains a job well done is the choice of tool. I suspect this checks out with many of the data scientists reading this, but I’ll make the argument regardless.
More importantly, accepting this fact should change the strategy to improving as a data scientist; one from developing impressive dexterity on a popular toolset to one focused on improving the dispatch mechanism itself. This isn’t true across the board (e.g. deep knowledge is needed for specializing and tool dexterity is needed for speed), so I’ll make the argument more carefully later. But first:
Arguing it’s mostly dispatch
Consider a hypothetical. You’re beginning a 20 year career in data science and an omniscient oracle offers you a tremendous leg up on the competition. For every data science project you face, you may ask the oracle the same question once. This is a single question that’ll be answered perfectly and once per project. Also, the answer needs to be short; asking for the binary readout of a perfect solution isn’t in the spirit of the game. In fact, I realize this game isn’t well defined2, but it’s that game spirit I’d like you to focus on. So, which question would you repeatedly ask? I’d ask:
What tool should I use?
Between our starting point of ignorance and ending point of a well informed solution, this question’s answer moves us an impressive distance, right to the solution’s doorstep where documentation, demos and experts are available. To illustrate, here’s my paltry attempt to generate answers the oracle might provide for different problems:
- Excel: The problem can be solved fairly simply, with all intermediate values visible.
- MOSEK: The problem can be framed as an optimization problem, likely a large scale linear or convex one.
- LINDO: This also suggests the problem can be framed as an optimization problem, but one that can be coded succintly.
- AIMMS: A component of the problem is optimization, but the surrounding business operations can be managed with the AIMMS GUI.
- Stan: The problem isn’t very large and properly quantifying uncertainty and statistical relationships is essential.
- PyTorch and TorchVision: It’s a computer vision problem where you should probably use a pre-trained model.
- XGBoost: It’s likely a low signal-to-noise and tabular prediction problem.
- Optimizely: The problem can be solved and managed by running an A/B test via this platform.
- huggingface/transformers: The oracle is on the AI-hype train and recognizes your problem as one of natural language.
The choice across answers like this - that’s the hard part (and that’s why I’m not too proud of this list3). It requires a huge volume of knowledge and no single person has that knowledge. So comprehensive rigorous guidance for this question isn’t provided, anywhere. A book advertised to do so would be met with laughter and dismissal by the same instinct that predicted the General Problem Solver would fail. No one can query this database - it’s too damn big!
Consequently, it’s informed only by our biased and myopic experience. This is a problem known with the adage:
“To a hammer, everything looks like a nail.”
This makes the value of experience clear. Someone who’s worked many technical problems is closer to a toolshed than a hammer; they prove their value in their first move, their reach for the right machinery. For example, consider the well known data science champ Bojan Tunguz, someone who’s performed in the top ranks for hundreds of modeling challenges, almost since Kaggle’s inception. To understate it, he knows how to approach a diversity of problems. And, to the argument here, he’s memed his experience to a literal one word answer: XGBoost.
As it relates here, this suggests that not only is “what tool should I use?” the right question, but it often comes with one answer. My read of this is, ‘use XGBoost and you’ll be fine.’
But this appears to cut against my argument; if the question always yields the same answer, why ask it? Just always use XGBoost and ask a different question4. To this I say, Kaggle is only a slice of data science problems as I’ve defined them: industry problems addressed with data and software. They are clean, well-defined, mostly supervised prediction tasks - something I’m certain is a small subset of my definition.
If it’s mostly dispatch, focus on dispatch
If we accept the importance of smart tool-dispatch, we should give it its due attention. At the momement, we have no way to rigorously address the fantastically general case, but at the least, I’ll make some recommendations:
- Explicitly ask the which-tool question and give it a significant portion of the project’s time and effort: Not coding or writing feels like time wasted, but rushing into the wrong solution is worse than standing still. To throw out a number, if a data scientist is inexperienced and the problem is wholly new, I’d speculate it’s reasonable to invest 25% of the project’s time into researching tools for similar problems. Finding similar problems may not be possible; in this case, I believe it’s still time well spent. It’s worth it for the mere, not unlikely, possibility.
- Prioritize learning by what’s available in code5: All else equal, learning a technique that can only be applied after you code it from scratch is much less valuable than one paired with code, since the former brings a one-person development phase. In fact, a seasoned data scientist experiences from-scratch-coding as a pungent fail of the smell test, assuming it’s not for pedagogical purposes. This suggestion is already recognized with the popularity of paperswithcode.code and github’s explore of repositories.
- Make a habit of surveying tools: Personally, I do this by regularly checking Github. This has been helpful, but I suspect it’s not enough. There is universe of paid, closed source software that may frequently contain the best tool, so we should investigate this blindspot. As one approach, some modern textbooks (e.g. Operations Research) begin or end with a review of the software most utilized for the field, much of which is closed source but not unreasonably priced.
- Survey experts: This is so obviously useful, I hestitated to include it. But if we view it from the dispatch perspective, we’re nudged to be targeted in our conversations. An open ended discussion with an expert may produce a lot of white boarding (not something I’d disparage), but a which-tool question might produce a URL, an easier place to start coding from.
- Aim to be interdisciplinarity: Accept that you’ve been a hammer, not everything is a nail and it’s time to be a toolshed. As is also in the conventional wisdom, this means it’s useful to learn a variety of approaches to the same problem. If you use fancy machine learning for every problem you’ve faced, you’re being a young hammer.
- Characterize problems and tools by essential features that resolve the tool-choice: Characterize problems in terms of linear-or-not, optimization-or-not, signal-to-noise ratio, structured-or-unstructured, categorical-or-regression, high-or-low dimensionality, small-or-big data, dense-or-sparse or whatever else may be essential to explain which tools best match which problems. Characterize tools in terms of time/space-complexity, old-or-modern, software requirements, data-size requirements, GPU-powered-or-not, tree-based-or-not or whatever else may explain which tools apply where. Developing this matching heuristic is, in my view, the best way to answer the which-tool question without an oracle.
Caveats and Counterarguments
My definition of data science as ‘using data and software to solve a given industry problem’ is admittedly crafted for this argument. Some challenges I anticipate:
Isn’t the question of ‘What problem should I solve?’ more important? Yes, but recognizing this doesn’t give us much else to recommend. If I rephrased my argument as “the second most important question is which-tool”, not much else changes. In other words, I don’t see how to make concrete recommendations for how to ask better questions.
Data science is actually mostly communication (or something else interpersonal). Measuring importance by time-spent might give an answer like this. Regardless, communication is a major component in most jobs; acknowledging it doesn’t help us provide a recommendation specific to data science. Still, I can see this justifying an asterisk next to ‘Mostly’ in the title.
Isn’t the best tool dependent on the data scientist? As the question is framed, yes. It may be better to rephrase it as “What tool should a data scientist who’s moderately proficient with all tools use?” This data scientist doesn’t exist, but the answer is useful to any data scientist; it gives the perfect reason to become proficient in something new.
Should we not specialize? No, multiple things can be important at once. Expertise in a narrow subject and strong familiarity with a small toolset gives speed and performance. By all means, acquire these; they are also essential. This post is merely to apply a counteracting force to the bias this creates.
Disagree with me? Think this argument is missing something important? Email me at firstname.lastname@example.org and I may include it here with your name and my response. If I do post it, I’ll make sure we mutually approve it.
I’m excluding academic problems because those may reference tools themselves, which defeats my argument. For example, why do tree-based models still outperform deep learning on tabular data?. ↩
What does it mean for an answer to be ‘short’? What defines an allowable question? What about a question that mostly has short answers but occassionally has long ones? What defines the ‘best’ solution and how is the answer determined with respect to it? Because of this, the game isn’t playable and the argument isn’t rigorous. That’s why I’ve placed this post in the “Opinions and Speculations” section. ↩
This is certainly much less true for research. To emphasis, my argument is for only industry applications. ↩