-
Notifications
You must be signed in to change notification settings - Fork 2
Layer Workflow Prototype
The layer workflow is all about automating the process of uploading, updating, and deleting layers and associated metadata from MoL.
It does this by allowing source providers to manage their layer collections in a Git repository, which the MoL backend server pulls from automatically (or on-demand). The process is controlled source-by-source using a YAML configuration file, which defines source collections and associated layer metadata.
This page provides some early documentation about how to work with the prototype code currently in development.
You'll need to have python 2.6 or later installed on the machine from which you are testing. You will need to ensure that the simplejson, PyYAML and GDAL packages are installed on your computer.
Of course, you'll also need Git installed so that you can clone and use the MoL code repository with a command such as
git clone git-login@github.com:MapofLife/MOL.git
where instead of git-login you use your git login name.
Right now we have a script called MOL/workflow/mol-data/loader.py, which processes all source directories and all of their collection directories, then combines all collection-level fields with all DBF fields for each layer into a single CSV file for each collection, and finally optionally bulkloads the resulting CSV file to App Engine or the local development server. The script is intended to be run automatically by the backend server, but right now we'll run it manually using a terminal. Here's what you should see by running it in with the --help flag:
$ ./loader.py --help
Usage: loader.py [options]
Options:
-h, --help show this help message and exit
--config_file=FILE Bulkload YAML config file.
-d, --dry_run Creates CSV file but does not bulkload it
-l, --localhost Shortcut for bulkloading to
http://localhost:8080/_ah/remote_api
-s SOURCE_DIR, --source_dir=SOURCE_DIR
Directory containing source to load.
--url=URL URL endpoint to /remote_api to bulkload to.
-V, --no-validate Turns off validation of the config.yaml files being
processed.
The config_file is the bulkload.yaml file, not the config.yaml file inside a source directory. A dry run processes all of the layers for each of the collections in each of the sources in the mol-data directory into their respective csv files, but does not upload the entities to App Engine. The URL parameter should be self explanatory, and should be http://localhost:8080/_ah/remote_api if you are testing to a local App Engine instance. We'll see an example below showing how the parameters are used with the loader.py script.
In terms of directory structure for the layer data, each source (provider) must be in a distinct directory that contains all of the data for that source. A provider could have more than one source directory, if desired, but only one is necessary if all of the metadata about the source apply to all of the collections in that source directory. Each collection must have a distinct directory within its source directory. Each source directory must contain a single config.yaml file, which contains information about the source (provider) and defines each collection in that source's directory structure, including the collection directory names and the required/optional fields common to all layers in every configured collection. Each collection directory must contain all of the shapefiles for that collection. Here's an example of the directory structure:
mol-data/
source1/
config.yaml
collection1/
layer1.shp
layer1.dbf
...
collection2/
layer1.shp
layer1.dbf
...
collection3/
layer1.shp
layer1.dbf
...
...
source2/
...
The template config.template.yaml can be copied to create a new configuration file config.yaml within a source directory:
Running the loader.py script will process all source directories in MOL/workflow/mol-data. For the following example directory structure:
MOL/workflow/mol-data/
jetz/
config.yaml
mammals/
layer1.shp
layer1.dbf
...
and App Engine running locally (see below), if we run the following from within the `workflow/mol-data' directory:
$ ./loader.py --url=http://localhost:8080/_ah/remote_api --config_file=bulkload.yaml
we would see the following output:
INFO:root:Processing source directories: ['jetz']
INFO:root:Collections in jetz: ['mammals']
INFO:root:Processing 1 layers in the mammals collection
INFO:root:Extracting DBF fields from layer1.shp
INFO:root:All collection metadata saved to collection.csv.txt
Uploading data records.
[INFO ] Logging to bulkloader-log-20110815.214504
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
Please enter login credentials for localhost:8080
Email:
[INFO ] Opening database: bulkloader-progress-20110815.214504.sql3
[INFO ] Connecting to localhost:8080/_ah/remote_api
[INFO ] Starting import; maximum 10 entities per post
[INFO ] 1 entities total, 0 previously transferred
[INFO ] 1 entities (11832 bytes) transferred in 1.5 seconds
[INFO ] All entities successfully transferred
Uploading data records.
[INFO ] Logging to bulkloader-log-20110815.214505
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Batch Size: 10
Please enter login credentials for localhost:8080
Email:
[INFO ] Opening database: bulkloader-progress-20110815.214505.sql3
[INFO ] Connecting to localhost:8080/_ah/remote_api
[INFO ] Starting import; maximum 10 entities per post
[INFO ] 1 entities total, 0 previously transferred
[INFO ] 1 entities (14222 bytes) transferred in 1.1 seconds
[INFO ] All entities successfully transferred
INFO:root:Loading finished!
You may be asked for login credentials to the local App Engine instance. If you haven't done anything to set those up, just hit ENTER at the Email: prompt.
loader.py will try to check the fields in your config.yaml file against the fields specification; since the specification has not yet been finalized, it may be hard to keep modifying your file to conform to the specification. In this case, use the --no-validate option to temporarily turn off field validation.
After you run the loader.py script, information about each layer in each source collection has been processed and loaded to the App Engine datastore or the local development server (unless the dry run parameter was used, in which case the data were processed as far as the csv files ready for App Engine bulkloading). Note that the script does not yet handle updated and deleted source directories, collection directories, and layers.
For testing we want to upload to and view loaded entities in a local App Engine data store. To do this, App Engine needs to be running locally. The App Engine libraries are all in the MoL directory tree, so you don't need to install it. You just need to make sure that your path contains path_to_mol/lib/google_appengine, where path_to_mol is the root of your local MoL git repository. Once that is on your path you will be able to start up the local data store using the following command:
$ dev_appserver.py -c .
The -c parameter tells the dev server to clear out the data store before starting it up. After running dev_appserver.py you should get infromational messages, the last of which should be something like the following:
INFO 2011-08-18 19:41:47,191 dev_appserver_multiprocess.py:637] Running application mol-lab on port 8080: http://localhost:8080
The dev server will remain running until you stop the process with CTRL-C.
The final part of testing is to view the data you have uploaded to the local server. You can access the datastore viewer at in a browser at http://localhost:8080/_ah/admin/datastore. If the example data were uploaded successfully, you would be able to see two Entity Kinds on the list in the data viewer interface - Layer and LayerIndex. If you click on the "List Entities" button for Layer you should see one entity in the list, and if you look at its details you should see source-level data from the config.yaml file. If you select the LayerIndex entity kind instead of the Layer entity kind, you should see a longer list of entities, and the details for these contain the information about the distinct layers in the source.