We have a pipeline that reads the from s3 , does some processing using a python script(circled in flow diagram) which involves split and merge of one whole bunch of json files and groups the similar entries into different files and copies them into a different s3 bucket. There is a SQS listening the source s3 bucket and script gets filenames and copies them from s3 to process every mins.

All of this split-merge process happens in a single host. We are expecting a significant surge I the traffic and want to scale up this process.

One option we are looking at is to run this process from multiple hosts using internal scheduler tool and turn the read messages invisible in SQS so that they are read only once. We are yet to experiment this but want to take any suggestions if this can be done in a different way.

- 1f0d41prr3c11 - Scale up data processing from single node to multi node : bigdata

Source link

No tags for this post.


Please enter your comment!
Please enter your name here