“Driving like everyone else is drunk.”
Defensive driving means more than passing a permit test. Defensive driving does prescribe specific methods, like the “two seconds ahead” rule, or slowing down during rainstorms. But it really emphasizes following general principles regarding your own actions, and how to account for actions of other vehicles.
“Expect the unexpected in the road,” “Don’t expect other drivers to behave like you do,” and “Never trust others’ (lack of) turning signals” all apply generally.
In past posts I wrote about Data Literacy and overcoming assumptions about your data. You could follow checklists with specific steps to perform good data analysis, but Data Literacy also means a certain way of safely handling data, too.
As the newly minted PhD Kevin Z Hu recently put it, getting from Point A to B is unremarkable, but a careless slip-up can mean a calamitous crash. I believe certain general approaches can convey you there consistently.
Check Your Mirrors
You should trust a moderate Data Skeptic. They check their assumptions: do I have duplicates? What about blanks, or nulls? It pays to glance before changing lanes.
Examples of specific methods include: testing the uniqueness of the primary key (“what entity is this table about?”), testing data relationships (One-to-one? Many-to-one?) to uncover duplicates, and really understanding the GROUP BY statement, if applicable.
It bears repeating: What tables or data sources are behind this metric someone passed you? Maybe select * to check out the shape of the data.
Mind Your Blindspots
Analysts, consider: Do you first ask whether your team even collects the data relevant to your business question? Does that metric truly measure what it claims to measure, or do I just want it to? Confirmation bias is an alluring drug.
It’s crucial when operating out of your comfort zone (say, a new data domain) to be wary of what lays beyond the mirrors, and do a cursory sanity check. Countless exploratory analyses have leapt to their deaths via “wishful data thinking” – wherein someone takes a column with a familiar name to mean what they hope it means.
The earlier that assumption gets made, the more damage caused.
Other Drivers Might Behave Unexpectedly
Analysts live a catch-22. Sometimes, their tireless research can’t convince management of the next great opportunity, or that a five-alarm operations fire lurks around the corner.
But they also know that any number, once exposed, can go viral and live a life of its own in the minds of their bosses, or worse – clients.
So, naming things well matters. Documentation matters! Concise blurbs that describe your fields and what your analysis should or should not be used for can go a long way to stem future misuse. And be careful what you share and how you share it.
I know a data scientist who wisely put big, fluorescent watermarks (“WARNING! DO NOT USE! NUMBERS NOT REAL!”) all over their preliminary results for this reason.
Everyone knows someone with a very messy car. “Just move that trash out of the way,” they’ll say, as you brush aside a bag of extra napkins and fight for legroom with their guitar case.
I’m not judging that (glass houses and all), but you wouldn’t leave your rental car a mess.
So, if you’re sharing queries, prefix and alias your columns! “ID” is not a descriptive field name, and you will thank yourself for commenting your code – even briefly.
If I write a query, the next person to use it will probably be future me, and my memory’s just not that good. Caitlin Hudon has a nifty blog post on this. Be kind to others; be kind to your future self.
Don’t Ignore the Obvious
Lugnuts fall off? Tail light out? Got wind that the business is about to swap an internal software system that also feeds reports that run the business? Consider the downstream implications and act. Inform the data team!
The last thing you want is to ignore the check engine light for weeks, only to discover that your data went stale a month ago.
My own driver’s ed teacher pitched her roomful of 16 year-olds with the snappy adage, “driving defensively prevents carelessness from causing carlessness.” I’m not as clever as her, so I’ll leave you with this:
Assume your data is drunk, and you’ll get from point A to point B safely.