Norway


TL;DR (query in bold bellow):

R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;R;

Table has a 1,000 million rows and integer columns, we can use `IN Clause` in 9 of the columns and `BETWEEN Clause` in the remaining unix_ column.

We need to `GROUP BY` a selection of the `IN columns` (ex: column_1 and column_2).

We also need to `GROUP BY` the unix_time_column. For instance: custom_format_unix_time(946758896000, ‘daily’) // returns: 2000-01-01 )

// we could group monthly, daily and in other time intervals which the custom function takes care of formatting

Then we use COUNT(*) on the grouped results which looks something like this:

+————+————————————-+—————+

| x_axis_days | y_axis | num_of_events |

+————+————————————-+—————+

| 2000-01-01 | column-1-value – column-1-value | 3423 |

| 2000-01-01 | column-1-value – column-2-value | 1533 |

| 2000-01-02 | column-2-value – column-1-value | 9323 |

| 2000-01-02 | column-2-value – column-2-value | 3229 |

+————+————————————-+—————+

We are not sure if this is something MySQL/PostgresQL should be able to resolve or we need Presto/Hive something more powerful.

What technology would you use?

———————————————————————————————————————————————————————————————————

We will have this table with a 1,000 million rows.

This table has 10 columns, all integers ( keys).

We need to be able to handle this query that basically creates a `heatmap` chart that we show to the user (basically a table with an x_axis, y_axis and values.

The query groups by day (which is one of the columns, an integer in unix_time_column) and groups by a selection of the other columns (for instance by column_1 and column_2).

An example Query looks like this:

SELECT

count(*) as num_of_results,

/* 1) we group daily, this returns something like this: ‘2000-01-01’ */

format_time(unix_time_column, ‘%Y-%m-%d’) AS x_axis_days,

/* 2) we also group by any columns */

concat(column_1, ‘ – ‘, column_2) y_axis

FROM

my_table

WHERE

column_5 in (1, 2, 3)

AND

column_8 in (4)

AND

/* we could have an AND for all column or just one/some… usually it will be for around 3-4 columns */

column_9 in (7, 8)

AND

/* here we do the BETWEEN 2 dates previously transformed to unix_time for performance reasons */

unix_time_column BETWEEN 946758896000 AND 956758896000

GROUP BY

y_axis, x_axis_days

ORDER BY

x_axis_days, y_axis

Which would give a result like this:

+————+————————————-+—————+

| x_axis_days | y_axis | num_of_results |

+————+————————————-+—————+

| 2000-01-01 | column-1-value – column-1-value | 3423 |

| 2000-01-01 | column-1-value – column-2-value | 1533 |

| 2000-01-02 | column-2-value – column-1-value | 9323 |

| 2000-01-02 | column-2-value – column-2-value | 3229 |

+————+————————————-+—————+

Basically this are the requirement:

  1. We can filter by any of the 9 integer columns using IN

  2. There’s a `unix_time_column` column that is also an integer but we use a BETWEEN clause in it

  3. We have to GROUP BY the `unix_time_column` which is a unix_time that we format to either daily/monthly/every-12-hours/hourly/30-mins/5-mins)

  4. We have to GROUP BY a selection of `IN column`s which is a unix_time that we format to either daily/monthly/every-12-hours/hourly/30-mins/5-mins)

  5. Table will have a million rows (1,000 million)

  6. Each `IN column` will have around 20-30 different values and I assume should be indexed.

  7. Budget is around 200 dollars a month and we plan to use AWS.

Around how many rows do you think would be too many before having performance issues?.

Thanks for reading 🙂



Source link

No tags for this post.

LEAVE A REPLY

Please enter your comment!
Please enter your name here