For maybe more context, let’s say we’re doing batch processing every day on no more than 10 GB of data. I’ve seen companies go for Spark in this scenario, which boggles my mind. What’s likely the reasoning here?
If the incoming data was much much larger at say 1 TB/day, would Spark be the preferred solution, generally speaking?
How much does the ability to use the streaming and ML packages play into decisions to use Spark? If a team wants to also do ML, I’m guessing having that ability on top of a data set is pretty convenient, so Spark is ideal?
To be honest, as someone still learning its inner workings and coming from a traditional ETL developer background, I just don’t understand the hype around Spark.