We have a data pipeline that reads the data from s3 , does some data processing using a python script(circled in flow diagram) which involves split and merge of one whole bunch of json files and groups the similar entries into different files and copies them into a different s3 bucket. There is a SQS listening the source s3 bucket and script gets filenames and copies them from s3 to process every 10 mins.
All of this split-merge process happens in a single host. We are expecting a significant surge I the traffic and want to scale up this process.
One option we are looking at is to run this process from multiple hosts using internal scheduler tool and turn the read messages invisible in SQS so that they are read only once. We are yet to experiment this but want to take any suggestions if this can be done in a different way.