I have an algorithm that requires getting the fastest route between two latitude, longitude points.

The works in Python using Graphhopper API hosted on localhost. Now I have to apply this algorithm to trillions of points, so I use Apache Spark (which may or may not be the best tool in this case; please comment on that). I import the Graphhopper Java library ( into my Scala project, and the first thing I tried was to create the Graphhopper object on the master node and initialize it there (by reading an osm.pbf file => around 2 GB, uses 340 MB heap space after initialization), and then create a broadcast variable so I can use it to route the points on each node. That did not work; I used kryo serializer, and it said there was a concurrent access exception. So I then tried to send the file to each node before the job, and then initialize the object on each node. That works, though the problem is I can only think of using .map() and inside the map I initialize the object, which means for each element I initialize the object, which is not what I want, since that would take an eternity (initializing takes 1-2 minutes on a m4.xlarge EC2 instance). I’d like to be able to initialize the object once per node (which is basically a broadcast variable, except broadcast variables don’t work – if someone can suggest how to broadcast this, then that’d be the best solution).

My other solution is just to deploy a load-balanced system of Graphhopper APIs, and then my Spark job would make http requests to the API, but never tried anything like that, so if anyone has experience making http requests within Spark jobs, let me know.

Otherwise, the other way I can think of is to rewrite the library myself to make it broadcast-able, but before I get there, I’d like to have some input on better solutions ( another library that can get fastest routes).

EDIT: entry point of the library:

Source link

No tags for this post.


Please enter your comment!
Please enter your name here