5 lessons learned from building a data pipeline
There is no clear winner in the debate of whether to buy or build software.
The arguments for both sides are plentiful, and the decision is one that can have a huge impact on an organization’s resources, operations, and funds. In fact, customers frequently ask my team: Why should I buy instead of build?
Through these conversations, we discovered a need for a fully-managed data pipeline. And after much consideration, we decided to build one that could service the needs of a number of media companies.
An analytics data pipeline is infrastructure—plumbing, really—for reliably capturing raw analytics data from websites, mobile apps, and devices. The goal of an ideal data pipeline is to fade into the background: to allow arbitrary data capture, streaming access, and infinite storage, but otherwise to “just work” efficiently.
So whether you decide to build or buy, you can benefit from the lessons we learned while building our own real-time pipeline. At least that way you’ll be armed with what you need to meet the challenge.
Lesson #1: With analytics data, scale matters
The last thing you want is to be paged at 2 a.m. on a Monday night because your data pipeline went down. When building, remember that you are potentially dealing with millions of events per day, and you need a system that can handle this massive amount of data.
It’s easy to get the proof of concept working, with a single data center, when data is flowing on the “happy path” of manicured test data sets. But what about when you have a major traffic spike? What about when buffers fill, your server’s CPU spikes, or your disk’s I/O fluctuates? Have you thought about long-term storage costs or compression? What about when customers send you malformed data? In all of these cases, and many others, only a scaled operation will be able to stay up in the face of adversity.
Lesson #2: One server is never enough
When a programmer collects data from a range of places and funnels that information to a single server, they are simplifying the process. If you’re collecting data from users across the web and sending all that information to one place, you risk severely impacting user experience.
What if your visitors are in Europe, Asia, or Australia? What if entire regions of your cloud hosting provider go down? Can you afford for data collection to be offline? Have you thought about retries, replays, and backup? Doing it the “simple way” may be easy now, but it could cause trouble down the road.
Lesson #3: Rollups make things cheaper, but at a great expense later
When people think of analyzing user data, they tend to count how many times people visit a page. An example of this is tracking how many times John Smith visits your site in a month. Taking this type of approach means you’re missing out on a lot of valuable information. It’s a question of raw data storage versus rollup storage.
Rollups are always cheaper and easier, at first. Yes, you could keep a tally of the number of visitors that log in when they visit your site. But you may learn more valuable information if you follow the behavior of the 10% of people who rarely log in. Or, the 10% who log in daily. Who are they and how are they using your site? You can only answer that question with raw data.
Tracking every action in a raw way allows for the development of insights that tell even more important stories. Trust me, your future data scientists will curse the day you decided to throw away valuable user data in the interest of efficiency. A hosted data pipeline can let you take a “capture everything” approach; this is the approach used by the pros at Google and Facebook, and it’s the way your team will be able to gain insight from the data.
Lesson #4: Without enrichments, it’s hard to derive insights
Studying what users do once they’re visiting your site is great, but if you don’t analyze visitor dimensions—such as the devices they’re using, their geographic regions, and traffic source categories—you can’t cleanly draw useful conclusions.
These are each examples of “enrichments,” and a good hosted data pipeline will perform these automatically for you. A fire-and-forget logging system without built-in smarts will leave this kind of analysis to you, thus wasting more of your engineering organization’s time.
Lesson #5: Once a data pipeline becomes a source of truth, reliability matters
My best advice for ensuring reliability is to set up alerting and monitoring systems so you know about mishaps the instant they occur and can fix them promptly.
Rob Story, data engineer at Simple Finance, explained that he realized how important his infrastructure had become when it went down for Amazon’s Redshift maintenance window, and inadvertently knocked down production services that were relying upon the data.
I heard a similar story from Samson Hu, the engineer who helped build the analytics data pipeline for 500px. Hu said the number-one problem he faced with his data pipeline buildout was ensuring reliable operation around the clock. Valuable data tends to become embedded into production services more quickly than you’ll realize. Of course, if you want to avoid this problem altogether, make uptime and reliability someone else’s problem through hosted offerings.
Final Lesson: Moving data in-house isn’t easy, but it pays off
Building a data pipeline isn’t an easy feat, but the payoff of owning your own data is huge. You can finally move your organization from mere reporting of top-line metrics to actual insight at the most granular level.
Whether you build your own pipeline or start evaluating hosted options, you can benefit from the above lessons we learned during our last few years of production experience.