Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

(adamdrake.com)

100 points | by tosh 5 hours ago

18 comments

MarginalGainz 2 hours ago
The saddest part about this article being from 2014 is that the situation has arguably gotten worse.
We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
[-]
- pocketarc 1 hour ago
  I agree - and it's not just what gets you promoted, but also what gets you hired, and what people look for in general.
  You're looking for your first DevOps person, so you want someone who has experience doing DevOps. They'll tell you about all the fancy frameworks and tooling they've used to do Serious Business™, and you'll be impressed and hire them. They'll then proceed to do exactly that for your company, and you'll feel good because you feel it sets you up for the future.
  Nobody's against it. So you end up in that situation, which even a basic home desktop would be more than capable of handling.
  [-]
  - jrjeksjd8d 1 hour ago
    I have been the first (and only) DevOps person at a couple startups. I'm usually pretty guilty of NIH and wanting to develop in-house tooling to improve productivity. But more and more in my career I try to make boring choices.
    Cost is usually not a huge problem beyond seed stage. Series A-B the biggest problem is growing the customer base so the fixed infra costs become a rounding error. We've built the product and we're usually focused on customer enablement and technical wins - proving that the product works 100% of the time to large enterprises so we can close deals. We can't afford weird flakiness in the middle of a POC.
    Another factor I rarely see discussed is bus factor. I've been in the industry for over a decade, and I like to be able to go on vacation. It's nice to hand off the pager sometimes. Using established technologies makes it possible to delegate responsibility to the rest of the team, instead of me owning a little rats nest fiefdom of my own design.
    The fact is that if 5k/month infra cost for a core part of the service sinks your VC backed startup, you've got bigger problems. Investors gave you a big pile of money to go and get customers _now_. An extra month of runway isn't going to save you.
    [-]
    - woooooo 58 minutes ago
      The issue is when all the spending gets you is more complexity, maintenance, and you don't even get a performance benefit.
      I once interviewed with a company that did some machine learning stuff, this was a while back when that typically meant "1 layer of weights from a regression we run overnight every night". The company asked how I had solved the complex problem of getting the weights to inference servers. I said we had a 30 line shell script that ssh'd them over and then mv'd them into place. Meanwhile the application reopened the file every so often. Zero problems with it ever. They thought I was a caveman.
- shiandow 5 minutes ago
  If airflow is a layer of abstraction something is wrong.
  Yes it is an additional layer, but if your orchestration starts concerning itself with what it is doing then something is wrong. It is not a layer on top of other logic, it is a single layer where you define how to start your tasks, how to tell when something is wrong, and when to run them.
  If you don't insist on doing heavy compitations within the airflow worker it is dirt cheap. If it's something that can easily be done in bash or python you can do it within the worker as long as you're willing to throw a minimal amount of hardware at it.
- wccrawford 1 hour ago
  I've spent my last 2 decades doing what's right, using the technologies that make sense instead of the techs that are cool on my resume.
  And then I got laid off. Now, I've got very few modern frameworks on my resume and I've been jobless for over a year.
  I'm feeling a right fool now.
  [-]
  - fHr 23 minutes ago
    This exactly, actual doers are most of the time not rewarded meanwhile the AWS senior sucking Jeffs wiener specialist gets a job doing nothing but generating costs and leave behind more shit after his 3 years moving the ladder up to some even bigger bs pretend consulting job at an even bigger company. It's the same bs mostly for developers. I rewrite their library from TS to Rust and it gains them 50x performance increases and saves them 5k+ a week over all their compute now but nobody gives a shit and I do not have a certification for that to show off on my LinkedIn. Meanwhile my PM did nothing got paid to do some shity certificate and then gets the credit and the certificate and pisses of to the next bigger fish collecting another 100k more meanwhile I get a 1k bonus and a pat on the shoulder. Corporate late stage capitalism is complete fucking bs and I think about becoming a PM as well now. I feel like a fool and betrayed. Meanwhile they constantly threaten our Team to lay it off or outsource it as they say we are to expensive in a first world country and they easily find as good people in India etc. What a time to be alive.
- nicoburns 1 hour ago
  > datasets that often fit entirely in RAM.
  Yep, and a lot more datasets fit entirely into RAM now. Ignoring the recent price spikes for a moment, 128GB of RAM in a laptop is entirely achievable and not even the limit of what is possible. That was a pipe dream in 2014 when computers with only 4GB were still common. And of course for servers the max RAM is much higher, and in a lot of scenarios streaming data off a fast local SSD may be almost as good.
  [-]
  - dapperdrake 27 minutes ago
    Oldie-but-goldy:
    https://yourdatafitsinram.net/
- willtemperley 35 minutes ago
  Yep. The cloud providers however always get paid, and get paid twice on Sunday when the dev-admins forget to turn stuff off.
  It’s the same story as always, just it used to be Oracle certified tech, now it’s the AWS tech certified to ensure you pay Amazon.
- reval 2 hours ago
  I’ve seen this pattern play out before. The pushback on simpler alternatives seems from a legitimate need for short time to market from the demand some of the equation and a lack of knowledge on the supply side. Every time I hear an engineer call something hacky, they are at the edge of their abilities.
  [-]
  - networkadmin 1 hour ago
    > Every time I hear an engineer call something hacky, they are at the edge of their abilities.
    It's just like the systemd people talking about sysvinit. "Eww, shell scripts! What a terrible hack!" says the guy with no clue and no skills.
    It's like the whole ship is being steered by noobs.
    [-]
    - acdha 1 hour ago
      systemd would be a derail even if you weren’t misrepresenting the situation at several levels. Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws, whereas in this case it’s a different scenario where the extra functionality is both unneeded and making it worse at the core task.
      [-]
      - networkadmin 1 hour ago
        > Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws
        That's funny. I used to have to clean up the messes caused by systemd's design limitations and flaws, until I built my own distro with a sane init system installed.
        Many of the noobs groaning about the indignity of shell scripts don't even realize that they could write init 'scripts' in whatever language they want, including Python (the language these types usually love so much, if they do any programming at all.)
    - dapperdrake 25 minutes ago
      Eternal September
- RobinL 2 hours ago
  Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax
  [-]
  - briHass 18 minutes ago
    More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL.
    I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.
  - mrgoldenbrown 28 minutes ago
    IMHO the main point of the article is that typical unix command pipeline pipeline IS parallelized already.
    The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with.
- lormayna 1 hour ago
  For a dasaset that live in RAM, the best solution are DuckDB or clickhouse-local. Using SQLish data is easier than a bunch of bash script and really powerful.
  [-]
  - zX41ZdbW 38 minutes ago
    Though ClickHouse is not limited to a single machine or local data processing. It's a full-featured distributed database.
- 1vuio0pswjnm7 11 minutes ago
  "I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'."
  Also seen strange responses from HN commenters when it's mentioned that 1. (a) bash is large and slow compared to (b) ash and 2. (a) is better suited for use as an interactive shell while (b) is better suited for use as a non-interactive shell, i.e., a scripting shell
- petcat 2 hours ago
  > a robust bash script
  These hardly exist in practice.
  But I get what you mean.
benrutter 1 hour ago
This times a zillion! I think there's been a huge industry push to convince managers and more junior engineers that spark and distributed tools are the correct way to do data engineering.
I think its a similar pattern to web dev influencers have convinced everyone to build huge hydrated-spa-framework-craziness where a static site would do.
My advice to get out of this mess:
- Managers, don't ask for specific solutions (spark, react). Ask for clever engineers to solve problems and optimise / track what you vare about (cost, performance etc). You hired them to know best, and they probably do.
- Technical leads, if your manager is saying "what about hyperscale?" You don't have to say "our existing solution will scale forever". It's fine to say, "our pipelines handle dataset up to 20GB, we don't expect to see anything larger soon, and if we do we'll do x/y/z to meet that scale". Your manager probably just wants to know scaling isn't going to crash everything, not that you've optimised the hell out of everything for your excel spreadsheet processing pipeline.
[-]
- zug_zug 5 minutes ago
  Absolutely, when I worked at (semi-well-known unicorn) a half-dozen years ago on the data-engineering team the manager told me "Hey we want to use spark next quarter, that's a huge initiative."
  And I immediately asked, "in what capacity?" And the answer was don't-know/doesn't-matter, it's just important that we can say we're using it. I really wish I understood where that was coming from (his manager resume-building? somebody getting a kickback?)
torginus 3 hours ago
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
[-]
- embedding-shape 2 hours ago
  > I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON
  Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
  > Just how much data do you need when these sort of clustered approaches really start to make sense?
  I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
- rented_mule 2 hours ago
  I like the peer comment's answer about a processing time threshold (e.g., a day). Another obvious threshold is data that doesn't conveniently fit on local disks. Large scale processing solutions can often process directly from/to object stores like S3. And if it's running inside the same provider (e.g., AWS in the case of S3), data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.
- noufalibrahim 1 hour ago
  I remember a panel once at a PyCon where we were discussing, I think, the anaconda distribution in the context of packaging and a respected data scientist (whose talks have always been hugely popular) made the point that he doesn't like Pandas because it's not excel. The latter was his go to tool for most of his exploratory work. If the data were too big, he'd sample it and things like that but his work finally was in Excel.
  Quick Python/bash to cleanup data is fine too I suppose and with LLMs, it's easier than ever to write the quick throwaway script.
  [-]
  - dapperdrake 21 minutes ago
    Whenever I had to use anaconda it was slow as molasses. Was that ever fixed?
- dapperdrake 22 minutes ago
  Adam Drake's example (OP) also streams from disk. And the unix pipeline is task-parallel.
- KolmogorovComp 2 hours ago
  > Just how much data do you need when these sort of clustered approaches really start to make sense?
  I did not see your comment earlier, but to stay with Chess see https://news.ycombinator.com/item?id=46667287, with ~14Tb uncompressed.
  It's not humongous and it can certainly fit on disk(s), but not on a typical laptop.
- zjaffee 2 hours ago
  It's not about how much data you have, but also the sorts of things you are running on your data. Joins and group by's scale much faster than any aggregation. Additionally, you have a unified platform where large teams can share code in a structured way for all data processing jobs. It's similar in how companies use k8s as a way to manage the human side of software development in that sense.
  I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).
  The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.
  [-]
  - dapperdrake 20 minutes ago
    Other way around. Aggregation is usually faster than a join.
- commandersaki 2 hours ago
  How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.
  [-]
  - torginus 1 hour ago
    I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens.
    The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.
    I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).
    It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.
    I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).
    The preprocessor was able to run on the build machines and DSes consumed the CSV directly.
    [-]
    - briHass 6 minutes ago
      This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.)
      Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.
  - rented_mule 2 hours ago
    I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first.
  - giovannibonetti 2 hours ago
    You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.
    Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.
    The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.
  - shakna 2 hours ago
    There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json.
    You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.
    One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream
rented_mule 2 hours ago
A little bit of history related to the article for any who might be interested...
mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.
Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.
I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.
paranoidrobot 3 hours ago
A selection of times it's been previously posted:
(2018, 222 comments) https://news.ycombinator.com/item?id=17135841
(2022, 166 comments) https://news.ycombinator.com/item?id=30595026
(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.
forinti 1 hour ago
I think many devs learn the trade with Windows and don't get exposure to these tools.
Plus, they require a bit of reading because they operate on a higher level of abstraction than loops and ifs. You get implicit loops, your fields get cut up automatically, and you can apply regexes simultaneously on all fields. So it's not obvious to the untrained eye.
But you get a lot of power and flexibility on the cli, which enable you to rapidly put together an ad hoc solution which can get the job done or at least serve as a baseline before you reach for the big guns.
fifilura 12 minutes ago
No joins in that article?
The comments here smell of "real engineers use command line". But I am not sure they ever actually worked with analysing data more than using it as a log parser.
Yes Hadoop is 2014.
These days you obviously don't set up a Hadoop cluster. You use the cloud provider service provided (BigQuery or AWS Athena for example).
Or map your data into DuckDB or use polars if it is small.
phyzix5761 42 minutes ago
It’s easy to overlook how often straightforward approaches are the best fit when the data and problem are well understood. Large expensive tools can become problems in their own right creating complexity that then requires even more tooling to manage. (Maybe that's the intent?) The issue is that teams and companies often adopt optimization frameworks earlier than necessary. Starting with simpler tools can get you most of the way there and in many cases they turn out to be all that’s needed.
KolmogorovComp 2 hours ago
> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.
It would be interesting to redo the benchmark but with a (much) larger database.
Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.
Would Hadoop win here?
[-]
- woooooo 49 minutes ago
  If you get all the data on fast SSDs in a single chassis, you probably still beat EMR over S3. But then you have a whole dedicated server to manage your 14TB of chess games.
  The "EMR over S3" paradigm is based on the assumption that the data isn't read all that frequently, 1-10x a day typically, so you want your cheap S3 storage but once in a while you'll want to crank up the parallelism to run a big report over longer time periods.
- dapperdrake 17 minutes ago
  Probably not.
  The compressed data can fit onto a local SSD. Decompression can definitely be streamed.
jonathanhefner 29 minutes ago
And since AI agents are extremely good at using them, command-line tools are also probably 235x more effective for your data science needs.
fmajid 2 hours ago
I've contributed to PrestoDB, but the availability of DuckDB and fast multi core machines with even faster SSDs makes the need for distribution all the more niche, or even cargo-culting Google or Meta.
ejoebstl 3 hours ago
Great article. Hadoop (and other similar tools) are for datasets so huge they don't fit on one machine.
[-]
- vjerancrnjak 3 hours ago
  https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one...
  I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.
  https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...
  Datasets never become big enough…
  [-]
  - saberience 2 hours ago
    Well, at my old company we had some datasets in the 6-8 PB range, so tell me how we would run analytics on that dataset on an Intel NUC.
    Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."
    [-]
    - dapperdrake 15 minutes ago
      These situations are rare not difficult.
      The solutions are well known even to many non-programmers who actually have that problem:
      There are also sensor arrays that write 100,000 data points per millisecond. But again, that is a hardware problem not a software problem.
  - literalAardvark 3 hours ago
    Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.
    Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
  - DetroitThrow 2 hours ago
    >Datasets never become big enough…
    Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
    Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
    [-]
    - zX41ZdbW 40 minutes ago
      When a single server is not enough, you deploy ClickHouse on a cluster, up to thousands of machines, e.g., https://clickhouse.com/blog/how-clickhouse-powers-ahrefs-the...
- PunchyHamster 3 hours ago
  And we can have pretty fucking big single machines right now
  [-]
  - dapperdrake 14 minutes ago
    https://yourdatafitsinram.net/
EdwardCoffin 1 hour ago
This makes me think of Bane's rule, described in this comment here [1]:
Bane's rule, you don't understand a distributed computing problem until you can get it to fit on a single machine first.
[1] https://news.ycombinator.com/item?id=8902739
rcarmo 3 hours ago
This has been a recurring theme for ages, with a few companies taking it to extremes—there are people transpiring COBOL to bash too…
nasretdinov 3 hours ago
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.
killingtime74 3 hours ago
Hadoop, blast from the past
cryptoboy2283 1 hour ago
Earlier in 2010 - http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...