Share post
I am a data platform engineer. Many budding data engineers in the world love to hear stories about how efficiency was peaked in some system reading tons of data, and talking about clever techniques around how to process data streams and hit hard SLAs and also handle 100+ GB tables in their data warehouses and how they stream 1 million messages per day etc., etc.
All of this is great and all, but if I read that the biggest problem in an organization that uses a lot of data is just on the tech side of throughput and data compute I envy them. I know then and there that my job is going to be infinitely easier with those problems than working with some of the other problems that exist at “big-data scale.” As such this post will focus more on the people problem that occurs. Let me explain…
Defining “at scale”
First off, let’s be clear, data at ANY scale is hard. Repeat that to yourself right now. It. Is. Hard. Don’t get naïve and think that the people working in small companies with data aren’t heroes because they absolutely can be. That said, WHAT is hard depends on how much data you process.
When you are a smaller company and the biggest amount of processing you do is generally on the scale of gigabytes, many of your problems are going to be in just figuring out the modelling itself since data is usually limited. You might even be cheating and pulling data directly from operational databases without any replicas (commonly seen as a faux pas in the data engineering community). This is where 90% or more of companies land and this is fine. At that scale you are more likely to be worried about how to get data to fit into a Postgres instance or similar (and if you are starting off by using typical databases analytically instead of jumping to the big engines like Redshift or Snowflake, I approve!).
When I personally say “at scale” I mean that we have tons of data just being ingested every day into our data lake. We also have processing on the order of many petabytes per day to update models in our data warehouses (yes plural) as well as a multitude of stream processors to make sure we get all the relevant change data based on events (because at this scale we can’t do change capture using snapshots of databases efficiently). The current org I work at (which shall remain unnamed) processes on the order of just over 1 exabyte (1018 bytes) of data per month. This isn’t even that much, but we still have all the problems of actual large scale companies.
Every day we have analytics teams creating new data models, and data science teams creating new features in order to train ML algorithms or fine tune some LLMs we use in-house (because of course we do that, “AI” is everywhere now). We also have dashboards that get updated, as well as some freshness and data validation monitors that run almost constantly. I won’t mention our full stack, but we have found issues in almost every existing tool for large-scale analytics.
I guess that’s a succinct way to put it. We are at a scale where technologies for processing data can just break and that’s still not the set of problems we spend most of our time on.
Data at scale is a hot mess
I would love it if just fixing tech at that scale was the full extent of our problems. But no, that’s the easier (and fun) part. The real problems hit much much harder. Generally, if you ask your company, “Hey what’s an order/sale?” you usually get a straightforward response. Sounds reasonable right?
Let me paint the picture of how it is to work at many big-data orgs. Bob works in a financial department at a company that sells some products. Bob wants to analyze how much money is being made at the company. Bob starts by naïvely by going to his department’s data warehouse and finds a dataset called “sales_2024.” He’s told the data is not fresh because their processing job is delayed due to prioritization in the data platform (damn those guys suck, always governing things). That’s okay, though, since Bob doesn’t need super fresh data. He queries the table but realizes the data seems off. Stupid Bob! That’s the dataset for forecasting revenue, what you wanted was the table called “transactions_2024.” Fair enough, so Bob continues and looks for the table called “transactions_2024.” But what’s this? Bob can’t query the table! Unfortunately Bob doesn’t have access and needs to request this because such things are sensitive of course.
Bob sighs and reluctantly uses an interface from the 90s to request access, which his manager needs to approve, and of course it just happens someone is out sick so he needs to wait at least a day. A full day, and Bob isn’t any closer to his goal. Okay, but these things happen, and anyway it only takes a day to get his access, so he then looks at the dataset. “What the hell?” he says, as he realizes that the data stops at around last month. “What kind of fuckery is this!?” thinks Bob. Well it turns out that Bob missed the announcement that the actual revenue table was going to be moved, and that now he should be using the new dataset called “transactions_new_2024.” Stupid, stupid Bob.
Bob goes to query this new dataset, but yet again needs more access. You might think the access would be the same, but that would make sense, and sense is not the name of the game in corporate governance.
Annoyed, Bob reluctantly goes through the same painful 90s interface, and eventually gets his access. Now that he has the data access, he is relieved that he can finally see the actuals. He looked again and realized he should join this data to some sales categories so he can measure which areas of the company are performing the best. “Should be easy since the transactions can be linked to the products purchased.” Bob’s stupidity is so thick it could be considered molasses.
He should have known that the team owning the data model had removed categories since that should be maintained by a separate team in charge of tracking order information. So Bob goes to the orders datasets and once again requests access and goes through the whole process to even try querying things. But wait, there’s seven different datasets that have the name “orders” in it. He asks why, and gets an answer that it took too long to add certain columns or something, so the teams just created a new table when needing a new data column. “Oh screw it, I’ll just request access to all of them.” he thinks. While he waits for his access, he vents his frustration to Jan who happens to tell Bob about the Data Catalog. WOW! a whole data catalog, this is what all the big orgs have that solve all confusion right? He rushes to the catalog hoping to find information to use to see what else he needs to get access to in order to avoid any more annoyance with guessing. He looks at the catalog for two minutes and despairs however, reading such helpful descriptions such as:
- “This is a dataset for our team’s operations”
- “Copy of dataset_x to add a column”
- “Add your dataset description here.”
- “-“
Regardless, Bob is a champ, and he decides “I’m gonna get this done.” He finally gets to use the new orders table, barely remembering what it was he was supposed to be doing (admit it, you got lost too). Unfortunately he finds out that the categories are incorrect. He goes to find out why and it turns out, categories means something TOTALLY DIFFERENT than product categories. He rushed to the catalog to see if he could at least find something that would help, but the description just read “orders” and any columns just had the name of the column in the description. Frustrated he angrily asks in the general channel “why are the categories different in the orders tables, how do I get product categories!?”
Someone types up the response “we don’t get that data from the upstream service” and then someone chimes in “actually that’s not the responsibility of our service to track this” while a third writes “have you tried looking into the actuals_2024 table?” Bob at this point has an aneurysm from the stress and is carted off. Meanwhile someone takes his place and produces a dashboard without categories based on the original “forecast” table (which is completely incorrect) and then gets a promotion.
Where does data platform engineering come in?
Why did I bring this situation up? Because “at scale” all of what you read about just now is what Data Platform Engineers in a big organization of many teams spend a majority of time on. It can feel like being glorified janitors.
Sure there is the odd data pipeline issue here and there, but generally on the engineering side, we get pretty good at handling that when you have literally thousands of jobs running across an org and don’t want to stay up all night to deal with on-call issues. No, we work the most on trying to make sure no one has to experience what Bob did. Trying and usually failing, but trying nonetheless.
Data Confusion is the term I use personally to describe such a situation. And it really only comes about when you have so much data coming into analytics systems that it is allowed to devolve into such a mess without any control. I think there are a couple major factors that contribute to this:
- Focusing on quantity, not quality
- Data modelers and analysts are pressured to build generic solutions that prioritize data ingestion and pipelining rather than attacking governance and cataloging useful information for others to consume.
- Because teams move fast, a lot of cleanup is just not done and things get left over and never addressed, adding to confusion of which datasets are actually useful while also just costing in storage.
- Lack of communication between teams
- Sources of truth devolve, and systems that normally should coordinate together start evolving in isolation.
- A wall starts to form between analytics/data engineers and service maintainers and it means that a lot of things get built in ways that completely shut out the other parties from discussion to keep velocity fast.
I haven’t given these a lot of thought really, and surely there are other factors, but generally yeah. In a big org without rules I see that people cheat to move fast, and it hurts everyone in the end and rather than talk about it, they’d rather just keep doing what they’re doing and monkey patch things to oblivion. A lot of blog posts will say “just create single sources of truth” or “use x” but to be honest, I don’t think those are made with honesty because the reality is no one in large money-making enterprises is going to blindly accept having entire departments stop everything to just get into a state of friendship and rewrite half their flows on a dime. It takes a lot more effort than that or a large consolidation of teams (psst that means layoffs which I hear people hate).
It’s kind of funny how when you reach large scale that the problems are more around people and less technical. But this is a theme I am sure other experts from other tech areas can recognize.
Accept that people will complain
When this mess happens, data platform teams need to pick up the slack. No one else will care that they are creating duplicate orders dataset number 100, and will expect to get all the resources to do so. There is usually some concept of prioritization or governance added into the mix of tools we build. This is because at the end of the day, processing data and accommodating all the processing people want to do despite sometimes not even having a concrete reason to do so is something that costs money, which again, companies like to save.
Because of the centrality of platform teams, it is almost always the case the platform teams are usually tasked with addressing the mess above. Basically, if in an exabyte-scale company the entire org full of analysts doesn’t hate you, you’re probably being too lax on some rule somewhere. Okay, that’s a bit extreme, generally people will not haaate you, but they will not appreciate the work you are doing by pushing back on them creating another copy of a dataset, or making their data processing job go to the back of a queue because there’s a financial processing model running to close the books (what you didn’t know financial teams use analytics warehouses to do reconciliation? You poor blissfully ignorant soul). Accept this professionally and move on.
You will also feel like a lawyer
Funny enough, a large part of running a large data platform in an org is making sure everyone is also playing by the rules. Access controls like we saw Bob dealing with are part of it, but at the end of the day your job is to work with security and legal teams. This is to make sure that, regardless the fact dataset creators agree to own responsibility, you are doing what you can centrally to prevent damage in the event they fuck up. And they will fuck this up, whether it’s maintaining data for too long, or accidentally storing info they shouldn’t have (whoopsie).
This means you will sit in meetings where you have to go through a list of controls and show designs around how you account for certain things like data leakage, or how you make sure that customers’ data is removed in a timely manner if requested. As a data engineer I am compelled to make sure I know what I need to do to protect your right and implement that protection into my platform.
The last bastion
Often times it is expected of the data platform teams to really cut down on waste and make sure the data you maintain is used effectively and people are not just playing around. Maybe it’s someone that runs queries that are too heavy or spams the warehouse with 1000 queries per minute, maybe it’s someone that creates too many training pipelines at once, or maybe someone spammed a Kafka cluster with messages. Regardless, you may need to actually address these as issues arise, find the team responsible, learn why they were doing what they were doing, and then lock down what they were doing that was wrong and build alternatives or other rules to prevent the bad stuff from happening again. Honestly it can feel like being a guardian of unruly children at times, but it is important to address issues head on.
This sounds like it sucks!
It does, but it’s a really fun problem space to own, too. Especially if you provide a data mesh for the company to use by enabling teams to spin up their own resources etc. etc. because now you get to have fun managing things across several different accounts if in cloud, or several datacenters if you have your own. And it is pretty cool to sit back sometimes and see someone click a button and suddenly have their own data processing stack complete with all the controls you need them to have, while being able to share that data widely back to the rest of the company.
So yeah things suck sometimes, but it’s still a cool gig. Also there are things you can push earlier to try and prevent issues before they get bigger.
Inventory is key
If you want to make sure your company is ready for truly big data, I would say everyone should get a data inventory made right now. Go now and use one of the MANY tools out there for this. Want to use something from the big brains at LinkedIn? https://datahubproject.io/ is a good catalog to start using. Don’t want to use that and instead want something else? Try out https://amundsen.io/. Or go build your own if you want. Just please use something and make sure that datasets in use have at minimum a description that shows what it is designed to do. If you rely on dataset names and maybe just a single sentence description, you’re gonna have a bad time.
You will also want to make sure each dataset is tracked with at least the following information:
- What team/contact is responsible for the dataset
- query frequency
- update frequency
- size/cost
- dependencies (can be other datasets, services, etc.)
This is because ultimately, any dataset is finite in lifetime, and when it comes time to do your companies cost cutting, you want to have confidence you can pick out the low-hanging fruit of what to cull or at least find what datasets are not as valuable. You can automate this, but the criteria you use will change depending on which phase of growth vs cutback your org is going through. Best be ready for it.
Bonus points if you build out categories for your data as well! Having column info labeled (e.g. financial data, personal data, etc.) is also a game changer if your company suddenly needs to shift to track the type of data users have access to.
Communication is also key
Data design in the analytics areas starts from the source, meaning the services that serve your customers. All designs around data changes should be made with analytics use cases in mind as well. This sounds hard to do, but luckily there is a concept that helps called data contracts, which helps turn the needs of analytics into a testable, verifiable document that services can incorporate into their testing flows during any data model or message schema changes. Honestly though as long as teams talk to each other mission accomplished. Don’t overthink it.
I also can’t stress enough the importance of getting service owners involved. Most times they are okay with the idea of taking analytics data into account, but just didn’t know their data was used. Likewise, it is important that you build a clear picture for analysts and data modelers where their data comes from. Surprisingly this is not well-known across all teams in bigger orgs, and this basically means the risk of building duplicate datasets is much higher. Having the inventory mentioned before really helps.
A stick and guidance is better than a carrot
I think this has to be said because I have seen it too many times that someone comes in with honest intentions and a good heart trying to clean things up. It doesn’t work to be nice and lenient and letting things slide when it comes to running an exabyte-scale platform and it never will be. You have a job to do, and the org will expect you to do it, and you should extend that same expectation to others that want to use your platform. If they want to play, they need to follow the rules. Obviously you will get people that whine and threaten the dreaded “escalation” but this is a ruse. That escalation will ultimately help you more than hurt because any reasonable higher manager will see your function as more important than some analyst or modeling team wanting to do abuse the resources the entire org relies on. Just remember, cost of the actual tech is one thing. Cost due to confusion brings much more risk.
Cut access when rules are broken. Dataset isn’t properly documented? Guess who doesn’t get to publish. Sound harsh? No, there is actual money and risk of misguiding the company on the line. Follow this up by showing teams what to do right and learn why they did what they did. i.e. fix any gaps in documentation and build out some helpful guides for them to prevent further issues. Offer trainings if you must. And of course be professional about this. Life is stressful at a large org as it is. You’re not here to power trip, so remember not to try and overplay your hand just for the sake of being right. Everyone is there to make the company grow effectively and win.
Is that it?
No definitely not! I am sure I will make posts specifically about the tech parts later, which cover things like breaking data warehouses, finding flaws in iceberg, or even overloading AWS account data planes to the point nothing in them works. But for now I think this is a good start for those that read blog posts that I think tend to focus too much on the happy tech side of things where it’s all about challenges of fixing things. And to be honest, that to me is not the majority story. The majority is this lovely mess of chaotic data use that we are stuck with to perpetually cleanup.
Data is a hot mess and I absolutely love being in the center of all of it.