I understand that Spark is good for streaming and ML. But with regards to just ETL tools/solutions, why should I Spark over a typical ETL solution or using a database to process the ?

For maybe more context, let’s say we’re doing batch processing every day on no more than GB of data. I’ve seen companies go for Spark in this scenario, which boggles my mind. What’s likely the reasoning here?

If the incoming data was much much larger at say 1 TB/day, would Spark be the preferred solution, generally speaking?

How much does the ability to use the streaming and ML packages play into decisions to use Spark? If a team wants to also do ML, I’m guessing having that ability on of a data set is pretty convenient, so Spark is ideal?

To be honest, as someone still learning its inner workings and coming from a traditional ETL developer background, I just don’t understand the hype around Spark.

Source link

No tags for this post.


Please enter your comment!
Please enter your name here