10: Managing Environments and Scaling - Conda, Dask, rioxarray#

UW Geospatial Data Analysis
CEE467/CEWA567
David Shean

We made it to Week 10! No formal assignment this week, so please use the time to work on your final projects.

Overview#

During class, we will discuss conda/mamba and the simple steps needed to migrate from the course Jupyterhub to your local computer.

We will also explore options to scale processing for larger datasets and more complex workflows using Dask for distributed computing and see some examples from the Pangeo project.

This is a bit of a catch-all week, and we will focus on the material that is most relevant to student needs and interests.

Environment setup#

conda#

  • Conda overview

  • conda vs. pip

  • mamba

  • conda-forge channel

  • Python site-packages

Migrating from course Jupyterhub#

Distributed processing#

GNU parallel#

Python multiprocessing#

Dask#

pandas#

dask-geopandas#

xarray Dask integration#

rioxarray#

Pangeo (http://pangeo.io/)#

“A community platform for Big Data geoscience”

https://gallery.pangeo.io/

Take a look at the rendered pangeo notebooks on Github (you can also clone the repo to our jupyterhub/locally if desired, or access through binder or their pangeo AWS hub). I recommend you work through the xarray and dask notebooks in the top-level directory and the amazon-web-services/landsat8.ipynb notebook.

One thing that I forgot to mention - if you are using the pangeo hub, when you are done running notebooks, please go to File->Shut Down. This will free up the node allocated to you, and stop the clock on the cloud charges. If you forget, it’s OK, your server should automatically time out and shut down, but best to be a good citizen here and avoid unnecessary resource consumption.

Glaciology examples#

Landsat-8 time series#

Other Discussion topics#

Collaboration with git/Github#

  • forking, branches, pull requests, merging

Licenses#

Data and code archiving#

  • Zenodo and other options for publishing repos