Thoughts on data in the workplace

Starting out on the server? Start here. — August 8, 2019

Starting out on the server? Start here.

Target audiences:
Analysts let loose on the server for the first time.
Students developing operational habits.

Maybe you’re an analyst who’s been given credentials to where the data scientists run their scripts. Or maybe it’s the first day of classes, and you’ve been given credentials to the homework server.

Ok, now what?

If this is your first time using the command line, it can feel meaningless or even disquieting to look at that blinking cursor without context. Here’s some:

Get Situational Awareness

I’ve written elsewhere on this blog about how important situational awareness is when dealing with data. Let’s apply those same principles:


Odds are, you’re on someone else’s computer, and it’s a Linux machine. Linux is an open-source operating system. You can think of it as a third OS, like the cousin to Microsoft Windows, or MacOS.

Think of servers as large, often shared, computers. Maybe it’s “in the cloud.” You’re probably accessing it remotely, and not walking up to an oversize computer and desk chair.

This image has an empty alt attribute; its file name is image.png
That’s me! And I’m on a computer named “DESKTOP-JCGIQGF”, apparently.

But where am I? Linux, like Windows or MacOS, has folders and directories, like Desktop, or your Downloads folder. When you first log in, you’re by default in your home directory. What’s that? To find out, enter pwd:

The terminal has returned the result of your command, and waits for more.

pwd returns where we are right now. cd lets you move around:

I happen to know I want to look at a folder at /mnt/c/, called “comp15”

and then pwd confirms this as well:


We know where we are, but we can’t see anything. What’s here? Enter ls:

ls lists the files and folders. Here I’ve got a bunch of folders.

Check out man ls for helpful flags you can add. ls -alh returns something quite different! Try it yourself.

Another neat command you can use to investigate your surrounds is tree:

A snippet of the results of the contents in subfolders that start with p


Maybe you’re investigating some logs or processes. What time is it? I mean, what time is it on this computer? Timezones and time discrepancies can be a nightmare. Simple – enter date:

What time the computer thinks it is right now.

This can be super important for communicating with others about things happening on the server. Observe the timezone. Shared computers, like AWS EC2 servers, often default to UTC, instead of local timezone.

If you’re an analyst that cares about logs on a shared server, there’s a good chance you’re in the unenviable position of investigating crontab runs. I won’t deepdive that here – check out this intro. crontab -l will show what cron schedules exist, if any.


If you’re on a shared computer, it can be tough to tell who’s doing what right now. That gets into more advanced topics, but htop can provide insight into what’s happening on that computer right now:

Not much is happening on my computer right now – just me and the OS hanging out.


Congrats! You’re on a linux server. You can get situational awareness quickly by finding out Where (pwd,cd), What (ls,tree), When (date,crontab), and even a little bit of Who (htop).

They Teach Data Safety in Drivers Ed — May 6, 2019

They Teach Data Safety in Drivers Ed

“Driving like everyone else is drunk.”

Defensive driving means more than passing a permit test. Defensive driving does prescribe specific methods, like the “two seconds ahead” rule, or slowing down during rainstorms. But it really emphasizes following general principles regarding your own actions, and how to account for actions of other vehicles.

“Expect the unexpected in the road,” “Don’t expect other drivers to behave like you do,” and “Never trust others’ (lack of) turning signals” all apply generally.

Are your metrics leading you in the right direction?
Photo by Gareth Harrison

In past posts I wrote about Data Literacy and overcoming assumptions about your data. You could follow checklists with specific steps to perform good data analysis, but Data Literacy also means a certain way of safely handling data, too.

As the newly minted PhD Kevin Z Hu recently put it, getting from Point A to B is unremarkable, but a careless slip-up can mean a calamitous crash. I believe certain general approaches can convey you there consistently.

Check Your Mirrors

You should trust a moderate Data Skeptic. They check their assumptions: do I have duplicates? What about blanks, or nulls? It pays to glance before changing lanes.

Examples of specific methods include: testing the uniqueness of the primary key (“what entity is this table about?”), testing data relationships (One-to-one? Many-to-one?) to uncover duplicates, and really understanding the GROUP BY statement, if applicable.

It bears repeating: What tables or data sources are behind this metric someone passed you? Maybe select * to check out the shape of the data.

Mind Your Blindspots

Analysts, consider: Do you first ask whether your team even collects the data relevant to your business question? Does that metric truly measure what it claims to measure, or do I just want it to? Confirmation bias is an alluring drug.

It’s crucial when operating out of your comfort zone (say, a new data domain) to be wary of what lays beyond the mirrors, and do a cursory sanity check. Countless exploratory analyses have leapt to their deaths via “wishful data thinking” – wherein someone takes a column with a familiar name to mean what they hope it means.

The earlier that assumption gets made, the more damage caused.

Other Drivers Might Behave Unexpectedly

Some bad assertions just won’t be stopped.
Photo by Conor Samuel

Analysts live a catch-22. Sometimes, their tireless research can’t convince management of the next great opportunity, or that a five-alarm operations fire lurks around the corner.

But they also know that any number, once exposed, can go viral and live a life of its own in the minds of their bosses, or worse – clients.

So, naming things well matters. Documentation matters! Concise blurbs that describe your fields and what your analysis should or should not be used for can go a long way to stem future misuse. And be careful what you share and how you share it.

I know a data scientist who wisely put big, fluorescent watermarks (“WARNING! DO NOT USE! NUMBERS NOT REAL!”) all over their preliminary results for this reason.

Hygiene Helps

Everyone knows someone with a very messy car. “Just move that trash out of the way,” they’ll say, as you brush aside a bag of extra napkins and fight for legroom with their guitar case.

“Oh, just toss ’em in the back.”
Photo by Per Lööv

I’m not judging that (glass houses and all), but you wouldn’t leave your rental car a mess.

So, if you’re sharing queries, prefix and alias your columns! “ID” is not a descriptive field name, and you will thank yourself for commenting your code – even briefly.

If I write a query, the next person to use it will probably be future me, and my memory’s just not that good. Caitlin Hudon has a nifty blog post on this. Be kind to others; be kind to your future self.

Don’t Ignore the Obvious

Lugnuts fall off? Tail light out? Got wind that the business is about to swap an internal software system that also feeds reports that run the business? Consider the downstream implications and act. Inform the data team!

The last thing you want is to ignore the check engine light for weeks, only to discover that your data went stale a month ago.

Photo by Marc Schäfer

My own driver’s ed teacher pitched her roomful of 16 year-olds with the snappy adage, “driving defensively prevents carelessness from causing carlessness.” I’m not as clever as her, so I’ll leave you with this:

Assume your data is drunk, and you’ll get from point A to point B safely.

Data Managers of One — February 11, 2019

Data Managers of One

We’re all typists, now.

If you work in the Knowledge Economy, you know: We are all our own secretary. The full-time role that some might mistakenly label the “modern secretary” – Executive Assistants – has become more skilled, more specialized, and an industry unto itself. Most people couldn’t hack it as a high-performing EA.

But calendar management, running a meeting, document sharing, and even the simplest Excel usage- these are basic workplace skills expected of all knowledge workers.

Photo by Samuel Zeller

That baseline has a corresponding, deeper set of valuable skills. SvN (authors of “It Doesn’t Have to Be Crazy at Work“) has written repeatedly on how the most effective employees are also Managers of One. Such employees clearly execute more than just low-friction calendar management and document sharing. We treat those skills today as basic components of workplace literacy.

Instead, a Manager of One excels at task identification, goal-setting, time management, delegation, and escalation. Per SvN:

When you find these people, it frees up the rest of your team to work more and manage less.

SvN, “Hire Managers of One

Data Literacy as Workplace Literacy

I argue that Data Literacy carves its own niche with a “deep set” of skills beyond the expected default. Just like the Manager of One increases their value and output with discipline and self-management, a Data Manager of One can do more than crack open a spreadsheet or read a chart.

A Data Manager of One can reason about a set of data, understand its sources, ferret out outliers, and have instincts for how the business changes the data, and how data can change the business. They can smell funky data from a mile off. And most invaluably, they ask precise, pointed questions about the data, and know when and how to delegate and ask for help.

Much has been written on how effective analysts and data architects dive deep with the business experts. There’s no data engineering or analysis without business context. But certainly the opposite rings true, and I have loved seeing it in my colleagues: growth toward facility with data (manipulating, interpreting, questioning) is a healthy career move.

Over time, more elements of Data Literacy will continue to creep into the skill set of the “typical” Knowledge Worker, increasing the demand of skills… and the demand for data. At any organization, this could present itself as either a virtuous cycle of learning and discovery, or a data demand spiral, spilling confusion and report breadlines everywhere.

Finance, already the domain of Excel wizards and set-thinkers, demonstrates this. At time of writing, there are over 11,000 unfilled postings on Indeed for “Financial Analyst + SQL”.

This is not to say all subject experts must learn SQL in particular, or that every team needs its own data engineer. There must be a team tasked with preparing and presenting data for the rest of the company, such that every end user can be more efficient and effective. Philosophies on how to best achieve this differ, but all agree that organizations don’t need everyone repeating the same analyses over and over.

The message for organizations and employees alike is that hiring and training for Data Managers of One speeds up the overall team, and, if supported with the right infrastructure, cultural norms, and documentation, allows you to work more and manage less.

The attendant message for analysts and data engineers remains to never cordon off from the rest of the org; that business context is priceless.


Nearly everyone could benefit from increased Data Literacy. There are plenty of resources online from BI vendors for generalists to understand general workplace data literacy and, in turn, use their products.

(I don’t necessarily endorse those)

I do endorse SQLzoo as the simplest, best introduction to SQL. It’s where I got started.

The First Query — February 2, 2019

The First Query

Nothin’ beats a good select stah.

J Falletti
A photo of a radar dish pointed upward at sunrise
Photo by Donald Giannatti

New job? New data set? First time using SQL? First time seeing that file? Reviewing someone else’s code/output?

When it comes to data analysis of any kind, people of all experience levels ought to begin with getting situational awareness:


This little query opens a thousand doors. I run it every day.

Highly Successful Data People have a habit: the instinct to first grasp, even in a shallow way, what the data looks like.

It bears repeating! Make it a reflex. Make it a habit.
Ask: what is the shape of the data? And follow your curiosity from there.

***Please query responsibly: use LIMIT 10/TOP10. Add your joins one at a time. Don’t hold your assumptions tightly. Have fun!