I am trying to build a infrastructure (preferably open source) to solve a big data problem.
Hence, Looking for options with Presto.
I have 4 large Fact tables, which consists of:
1) 2 Snapshot Facts : 40 attributes and approx 1TB in size.
Partitioned : PRODUCT_ID
Composite Sort Key : (Date, Country)
2) 2 transaction Facts : 200 attributes each and approx 50GB in size.
Partition Key : PRODUCT_ID
Sortkey : (Date)
Use cases to solve :
1) Volume : One Snapshot fact to be joined with other 2 transaction facts along with and 3-4 dimensions to generate Time-Series metrics aggregate data at different hierarchies.
2) Velocity : Output in less than 1min may be 2min.
3) High Concurrency : 50 user in parallel querying data from Tableau (Max users : 200)
I have only limited amount of budget to spend per year on infrastructure and software.
Either I can build Presto-Hive-S3/HDFS combo on EMR or buy Exasol(2TB) cluster.
But not sure, which one is better.