Sunday, June 22, 2008

Mapreduce for species in a Protected Area

The World Commission on Protected Areas (WCPA) publish shapefiles for protected areas, so I thought I'd run these against the GBIF index and get some species lists.
Since I was running on the full 135 million record GBIF index, and the IUCN Categories I-VI National shapefiles containing 60,753 polygons, I went back to Hadoop running in single node mode (i.e. the simplest of simple) as I am not sure the Mapreduce 'Lite' I wrote would cut it.

Input file: 135 Million records of 12 DwC fields (13G)
Stack: Hadoop for Mapreduce, Geotools for Polygon

I went for the brut force approach; the Map held a List<Polygon> and I looped each time over the polygons for each GBIF point record (60,000 x 135M = 8,100,000,000,000 comparisons ;o)

Of course this did not perform... but I wanted to post some benchmarks:

Reference: A Map that does nothing, and the GBIF index input: 500secs
Per Polygon: 1500secs to produce the species by polygon
(Note: This is running Mapreduce in a non clustered, single server environment)

I tried this using the polygon bounding boxes only, to remove Geotools from the equation in each Map operation and the results were the same.

It is clear that to do this kind of processing, the data needs to be sliced up to reduce the number of combinations.
I am pondering using the same tiling algorithm that the map guys seem to have settled upon (e.g. GE superoverlay style). By doing this, a preselect based on the intersect of the protected area bounding box with the tiled data would result in massively reduced processing. I am thinking 7 zoom levels resulting in 32768 distinct tiles so therefore working only on 2.8x2.8 degree cells. Currently GBIF models to 1x1 degree and 0.1x0.1 degree cell but this does not easily port to Mapreduce for partitioning purposes of the input data.
(As a side effect, the tiles could be processed in parallel for specific mapping views ready for immediate serving through Geoserver...)

1 comment:

Tim Robertson said...

[08:39:52 AM] Javier de la Torre says: hey tim, gonna take a look at your message now... I have been busy the whole weekend :)
[08:40:02 AM] Tim says: no worries mate
[08:40:06 AM] Tim says: I blogged last night.,...
[08:40:16 AM] Javier de la Torre says: ahh, gonan take a lool
[08:40:43 AM] Javier de la Torre says: lets see
[08:42:57 AM] Javier de la Torre says: uhmmm.. interesting...
[08:43:03 AM] Tim says: is it?
[08:43:10 AM] Tim says: I can never tell when you being sarchastic...
[08:43:26 AM] Javier de la Torre says: specially I would love to see how do you hack with geotools... you have to show me this live
[08:43:36 AM] Javier de la Torre says: no, really, weel, you know I love these topics
[08:43:37 AM] Tim says: is easy man
[08:44:00 AM] Tim says: i didn't check this in, but I will do tonight - is on laptop...
[08:44:46 AM] Javier de la Torre says: i would have tried different approach, as always :), but im not sure what is the best... thats a difficult question yeah
[08:44:48 AM] Tim says: remember... I am interested in HUGE scale, so before you say "use PostGIS"... I am wondering whether we can generate this kind of stuff in a massively parallel manner without hte need for DBs...
[08:44:57 AM] Tim says: what is your approach?
[08:45:17 AM] Javier de la Torre says: yeah yeah... i suppose you want to include this in the harvesting process...
[08:45:26 AM] Tim says: well... just playing really...
[08:45:28 AM] Javier de la Torre says: ni mean, doing it paralell to this
[08:45:38 AM] Tim says: what would your way be?
[08:46:00 AM] Javier de la Torre says: with gdal i think you can try if a point fail in a polygon
[08:46:09 AM] Tim says: gdal?
[08:46:14 AM] Javier de la Torre says: and that might be fast... but have never tried it
[08:46:25 AM] Tim says: I am doing that with Geotools...
[08:46:33 AM] Tim says: is Gdal not just geotools in C?
[08:46:40 AM] Javier de la Torre says: uhmm somehow...
[08:47:01 AM] Javier de la Torre says: i think, but this is just my impression!, gdal is more developed than geotools
[08:47:05 AM] Tim says: I am doing:

if (Polygon.contains(GeometryFactory.makePoint(x,y))) {
[08:47:26 AM] Javier de la Torre says: well... one think you should improve is trying first by bbox
[08:47:33 AM] Javier de la Torre says: thats what postgis does...
[08:47:49 AM] Javier de la Torre says: it always runs against bbox and then really check the polygon
[08:47:56 AM] Tim says: and I ran it by just starting the loop of 135M and getting hte BB and then testing that with straight

x>minX && x<maxX

type stuff
[08:48:01 AM] Tim says: and it was same speed
[08:48:10 AM] Tim says: Geotools is not the bottle neck
[08:48:14 AM] Tim says: I blogged that though...
[08:48:26 AM] Tim says: what you say is exactly what I do...
[08:48:39 AM] Tim says: 60,000 x 135M = 8,100,000,000,000 comparisons

is the problem...
[08:48:50 AM] Tim says: each 135M took 500secs...
[08:48:56 AM] Javier de la Torre says: i see, and probably file access in each query is also sow
[08:48:59 AM] Tim says: but I am running in non parallel
[08:49:04 AM] Tim says: no file access..
[08:49:13 AM] Javier de la Torre says: all in memory?
[08:49:17 AM] Tim says: other than the Map loop for the 135M records
[08:49:22 AM] Tim says: the polygon is in memory
[08:50:00 AM] Tim says: 60,000 x 135M = 8,100,000,000,000 comparisons

if I can reduce the 135M to something that is only say 1M then it is much quicker...
[08:50:16 AM] Tim says: takes 11 secs for 1M records and that is in single node...
[08:50:20 AM] Javier de la Torre says: well density tables
[08:50:31 AM] Javier de la Torre says: well, in an y case this is fast
[08:50:41 AM] Tim says: hmmm... density looses the point value
[08:50:44 AM] Javier de la Torre says: 11 secs to analyze 1M records look fast to me
[08:50:51 AM] Tim says: yes.,..
[08:51:02 AM] Tim says: and that single server mode... paralleled that would drop
[08:51:11 AM] Tim says: so 4 machines would do it in 3 secs...

[08:51:24 AM] Javier de la Torre says: yeah, but you have to consider the queality of the polygons... if the polygons arte not very detailed then it does not help having huge point precission
[08:51:26 AM] Tim says: "It is clear that to do this kind of processing, the data needs to be sliced up to reduce the number of combinations.
I am pondering using the same tiling algorithm that the map guys seem to have settled upon (e.g. GE superoverlay style). By doing this, a preselect based on the intersect of the protected area bounding box with the tiled data would result in massively reduced processing. I am thinking 7 zoom levels resulting in 32768 distinct tiles so therefore working only on 2.8x2.8 degree cells. Currently GBIF models to 1x1 degree and 0.1x0.1 degree cell but this does not easily port to Mapreduce for partitioning purposes of the input data."
[08:51:36 AM] Tim says: polygons are REALLY detailed
[08:51:51 AM] Tim says: many of them are covering only a couple 1KMs...
[08:52:14 AM] Javier de la Torre says: i see
[08:52:25 AM] Tim says: so you do need to do point checking...
[08:52:38 AM] Tim says: but if you can slice up the points int 2.8x2.8 files...
[08:52:42 AM] Tim says: into
[08:52:43 AM] Javier de la Torre says: well... but then you have it... i mean... 4 machines doing it in 3 secs?
[08:52:50 AM] Javier de la Torre says: thats very very good
[08:53:00 AM] Tim says: 4 machines doing it in 3 secs?

is only 1Million input points
[08:53:02 AM] Javier de la Torre says: and you only need to perform this one time...
[08:53:06 AM] Tim says: yeppa
[08:53:12 AM] Tim says: I am sure this is the way to go...
[08:53:29 AM] Tim says: 2.8 = 300Km or so...
[08:53:40 AM] Javier de la Torre says: yeah.... i wouldnt try to reduce it further...
[08:53:45 AM] Tim says: yeah...
[08:53:47 AM] Javier de la Torre says: i mean.. if the times you have scale
[08:53:57 AM] Tim says: I am wondering about buying 4 minimacs...
[08:54:10 AM] Javier de la Torre says: so 1M 4 machines = 3sec
[08:54:22 AM] Tim says: yes
[08:54:22 AM] Javier de la Torre says: 100M 4 machines = 30sec?
[08:54:26 AM] Tim says: no..
[08:54:28 AM] Tim says: well yes
[08:54:32 AM] Tim says: but:
[08:54:40 AM] Tim says: I wouldn't be feeding in 100M point
[08:54:40 AM] Tim says: s
[08:54:54 AM] Tim says: as it would be sliced up before
[08:55:05 AM] Tim says: and then there would only be 1M or so in each 2.8 cell
[08:55:08 AM] Tim says: sometimes more...
[08:55:28 AM] Javier de la Torre says: i see...
[08:55:32 AM] Tim says: you do?
[08:55:45 AM] Javier de la Torre says: thats the scaling idea you have right?
[08:55:55 AM] Tim says: uhm...
[08:55:56 AM] Javier de la Torre says: you mean like zoom levels
[08:56:04 AM] Tim says: and paralllel processing
[08:56:17 AM] Tim says: parallel processing to split the data up
[08:56:20 AM] Tim says: into tiles
[08:56:29 AM] Tim says: parallel processing to process the tiles into views
[08:56:44 AM] Javier de la Torre says: why? is it better to have 1M than 100M (apart of taking more time)
[08:56:55 AM] Tim says: will need done multiple times:

species by polygonX
species by polygonY
species by cell
cells by species
etc
[08:57:01 AM] Tim says: yeah...
[08:57:07 AM] Tim says: beause:
[08:57:15 AM] Tim says: you are comparing against 60,000 polygons...
[08:57:23 AM] Tim says: better to split ONCE
[08:57:33 AM] Tim says: than do 60,000 loops each on all points
[08:57:53 AM] Tim says: 60,000 x 135M = 8,100,000,000,000 comparisons
[08:58:07 AM] Tim says: sure it would work but take a long time
[08:58:10 AM] Javier de la Torre says: but...
[08:58:34 AM] Javier de la Torre says: 60.000 x 1M x 135 times isnt it the same result?
[08:58:39 AM] Javier de la Torre says: there is something i dont get
[08:58:41 AM] Tim says: no
[08:58:47 AM] Tim says: because...
[08:58:49 AM] Tim says: you do this:
[08:58:59 AM] Tim says: 1) Split the data into cells (135M records analysed)
[08:59:09 AM] Tim says: 2) start a loop of a polygon
[08:59:14 AM] Tim says: 3) get the BB of polygon
[08:59:22 AM] Tim says: 4) find the tiles that MIGHT hold data
[08:59:32 AM] Tim says: 5) use those tiles as input - say 1M records on average
[08:59:37 AM] Tim says: 6) go back to 2)
[11:00:12 AM] Tim says: so you do (135M x 1) and then (simple pre-select + (60,000 x 1M))
[11:01:16 AM] Javier de la Torre says: ahh so you want to go faster by creating the desnity table to use them on the bbox query...
[11:01:18 AM] Tim says: This is more inline to what PostGIS would do with an Index on Polygon and the Point
[11:01:30 AM] Tim says: well - it is not a density table...
[11:01:34 AM] Javier de la Torre says: yeah yeah
[11:01:37 AM] Tim says: no no
[11:01:38 AM] Tim says: it isn't
[11:01:46 AM] Tim says: density = count of records by some area
[11:01:50 AM] Tim says: this does not do that
[11:01:56 AM] Tim says: it is a subset of records within an area
[11:02:24 AM] Javier de la Torre says: so, thats like a density table for me... in the sense of grouping the records...
[11:02:56 AM] Javier de la Torre says: but yeah... the idea is to group the records into cells... ok dont call it density table...
[11:03:01 AM] Tim says: right... ok
[11:03:07 AM] Tim says: is not a density table though ;)
[11:03:12 AM] Javier de la Torre says: ok ok
[11:03:13 AM] Tim says: ok...
[11:03:18 AM] Tim says: I think we agree then?
[11:03:51 AM] Tim says: Is my email request for that chart possible?
[11:03:52 AM] Javier de la Torre says: well, still not sure if this would be the fastest... depends on the size of the polygons for sure...
[11:03:54 AM] Tim says: ;)
[11:04:02 AM] Javier de la Torre says: with hight details you have a winner
[11:04:13 AM] Tim says: sorry?
[11:04:15 AM] Tim says: hight?
[11:04:18 AM] Tim says: height?
[11:04:25 AM] Javier de la Torre says: high detail
[11:04:28 AM] Tim says: ahhh
[11:04:29 AM] Javier de la Torre says: on the polygons...
[11:04:30 AM] Tim says: yeah
[11:04:45 AM] Tim says: i will try it...
[11:04:52 AM] Javier de la Torre says: sounds good to me...
[11:04:55 AM] Tim says: it has to be somewhat generic solution...
[11:04:55 AM] Javier de la Torre says: and your idea...
[11:05:13 AM] Tim says: hey - should I copy this chat to the comments on the blog and then people can laugh at our conversation...
[11:05:13 AM] Javier de la Torre says: is that I can give you a polygon source for you to do the thing for me?
[11:05:18 AM] Tim says: yeppa!
[11:05:42 AM] Javier de la Torre says: cool... thats interesting!
[11:05:54 AM] Tim says: hey - should I copy this chat to the comments on the blog and then people can laugh at our conversation?
[11:06:11 AM] Javier de la Torre says: yeah, go for it...