Demo: NumPy, Pandas#

UW Geospatial Data Analysis
CEE467/CEWA567
David Shean

Introduction#

This is a quick demo of some key functionality for these core Python packages, emphasizing topics that will help with lab exercises this week and later in the quarter. It is by no means complete!

Please consult the reading assignment and lists of other excellent, more complete online resources.

NumPy#

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Pandas#

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.

Matplotlib#

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

Import necessary modules#

Use shorthand, so you don’t have to type out full module name each time
Note different structure for matplotlib package

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

NumPy 1D array#

#Create 1D array of random integers
#Note parenthesis and brackets
a = np.random.randint(0,10,10)
a

array([3, 9, 4, 6, 5, 4, 4, 3, 9, 1])

type(a)

numpy.ndarray

#np.ndarray?

Constructing an array#

#np.array?

np.array(0, 1, 2)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_512/449188611.py in <module>
----> 1 np.array(0, 1, 2)

TypeError: array() takes from 1 to 2 positional arguments but 3 were given

#Pass in an array-like object - need brackets around the numbers
np.array([0, 1, 2])

array([0, 1, 2])

mylist = [0, 1, 2]
np.array(mylist)

array([0, 1, 2])

Array properties and datatypes#

array([3, 9, 4, 6, 5, 4, 4, 3, 9, 1])

a.shape

(10,)

a.size

a.dtype

dtype('int64')

What is ‘int64’?#

Signed integer represented by 64 bits
Each bit can be 0 or 1
0 = 0000000000000000000000000000000000000000000000000000000000000000
1 = 0000000000000000000000000000000000000000000000000000000000000001
2 = 0000000000000000000000000000000000000000000000000000000000000010
…
https://numpy.org/doc/stable/user/basics.types.html

#Possible unique combinations of 64 bits
range = 2**64
range

18446744073709551616

print(f"{range:.2e}")

1.84e+19

mm = int((2**64)/2)
mm

9223372036854775808

f'A 64-bit signed integer can store values between -{mm} and +{mm}'

'A 64-bit signed integer can store values between -9223372036854775808 and +9223372036854775808'

# Overkill for our single integer values
a

array([3, 9, 4, 6, 5, 4, 4, 3, 9, 1])

#Number of bytes (8 bits each) for each element in the array
a.itemsize

#Total number of bytes for 10 elements
a.nbytes

# Recast to 8-bit unsigned integer (valid range: 0-255)
b = a.astype('uint8')
b

array([3, 9, 4, 6, 5, 4, 4, 3, 9, 1], dtype=uint8)

b.dtype

dtype('uint8')

2**8

array([3, 9, 4, 6, 5, 4, 4, 3, 9, 1], dtype=uint8)

b.nbytes

#Assign value within valid range
b[0] = 255
b

array([255,   9,   4,   6,   5,   4,   4,   3,   9,   1], dtype=uint8)

#Assign value outside of valid range - overlflow!
# https://en.wikipedia.org/wiki/Integer_overflow
b[0] = 257
b

array([1, 9, 4, 6, 5, 4, 4, 3, 9, 1], dtype=uint8)

2D arrays#

a2 = np.random.random((10,10))

a2

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938],
       [0.54214022, 0.12926722, 0.67341643, 0.97417468, 0.0349868 ,
        0.87610442, 0.24762983, 0.0228108 , 0.5683135 , 0.51484197],
       [0.66986602, 0.12030778, 0.43275939, 0.0487702 , 0.08680747,
        0.77752797, 0.42610501, 0.58918494, 0.24729525, 0.59084971],
       [0.85876637, 0.89397025, 0.96531165, 0.59977053, 0.00839696,
        0.93101856, 0.87715167, 0.79625335, 0.83850491, 0.10052971],
       [0.67171947, 0.35401307, 0.74514409, 0.16502892, 0.30921615,
        0.81304308, 0.074765  , 0.08685205, 0.31914575, 0.23735222],
       [0.47789035, 0.74734751, 0.26285763, 0.59484722, 0.7155864 ,
        0.59474681, 0.47150914, 0.03452718, 0.72332275, 0.47028776],
       [0.02877749, 0.83078924, 0.31289166, 0.30402027, 0.12422959,
        0.5741902 , 0.45545665, 0.8212274 , 0.76181614, 0.108464  ],
       [0.96086507, 0.83894286, 0.84632944, 0.26903294, 0.14547441,
        0.75889114, 0.28676771, 0.36896895, 0.37812949, 0.78041529]])

a2.shape

(10, 10)

a2.size

a2.dtype

dtype('float64')

#Get first element along first axis
#Question is this first row or col?
a2[0]

array([0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
       0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ])

#Get first element along second axis
a2[:,0]

array([0.52652326, 0.6461288 , 0.32809269, 0.54214022, 0.66986602,
       0.85876637, 0.67171947, 0.47789035, 0.02877749, 0.96086507])

#Get first element along both axes
a2[0,0]

0.5265232608453788

#Get slice along first axis - first 3 rows
a2[0:3]

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938]])

#Slice along second axis - first 3 cols
a2[:,0:3]

array([[0.52652326, 0.14231249, 0.07566598],
       [0.6461288 , 0.15507331, 0.90511239],
       [0.32809269, 0.08102169, 0.80168028],
       [0.54214022, 0.12926722, 0.67341643],
       [0.66986602, 0.12030778, 0.43275939],
       [0.85876637, 0.89397025, 0.96531165],
       [0.67171947, 0.35401307, 0.74514409],
       [0.47789035, 0.74734751, 0.26285763],
       [0.02877749, 0.83078924, 0.31289166],
       [0.96086507, 0.83894286, 0.84632944]])

a2[0:3,0:3]

array([[0.52652326, 0.14231249, 0.07566598],
       [0.6461288 , 0.15507331, 0.90511239],
       [0.32809269, 0.08102169, 0.80168028]])

ufunc#

Efficiently perform operation element-by-element in “vectorized” fashion (different than GIS vector dataset)
Do not loop over arrays (unless absolutely necessary)
https://numpy.org/doc/stable/reference/ufuncs.html

a2 * 10

array([[5.26523261, 1.42312486, 0.75665984, 5.97908434, 7.01544181,
        5.20640363, 0.08379791, 1.19271753, 1.39573862, 2.459467  ],
       [6.46128797, 1.55073312, 9.05112389, 8.07949708, 5.36184615,
        5.58583304, 3.75239908, 1.37705329, 2.27082693, 7.99296637],
       [3.28092688, 0.8102169 , 8.01680285, 2.39700296, 3.39230084,
        6.03944497, 0.65812771, 4.38819125, 1.3918295 , 8.19919379],
       [5.42140219, 1.2926722 , 6.73416434, 9.74174681, 0.34986796,
        8.76104416, 2.47629829, 0.228108  , 5.68313498, 5.14841969],
       [6.69866023, 1.2030778 , 4.3275939 , 0.48770196, 0.86807469,
        7.77527967, 4.26105014, 5.89184939, 2.47295249, 5.90849707],
       [8.58766372, 8.93970246, 9.65311645, 5.99770527, 0.08396962,
        9.31018559, 8.77151667, 7.96253349, 8.38504906, 1.00529706],
       [6.7171947 , 3.54013069, 7.45144089, 1.65028921, 3.09216152,
        8.13043078, 0.74765   , 0.86852048, 3.1914575 , 2.3735222 ],
       [4.77890351, 7.47347509, 2.62857634, 5.94847221, 7.15586397,
        5.94746812, 4.71509138, 0.34527179, 7.23322755, 4.70287758],
       [0.28777487, 8.30789238, 3.1289166 , 3.04020265, 1.24229588,
        5.741902  , 4.5545665 , 8.21227395, 7.61816135, 1.08463997],
       [9.60865071, 8.38942857, 8.46329441, 2.69032942, 1.45474407,
        7.58891143, 2.86767713, 3.6896895 , 3.78129487, 7.80415291]])

# Don't do this!
#for n, i in enumerate(a2):
#    a2[n] = i + 10

#np.power(a2, 2)
a2**2

array([[2.77226744e-01, 2.02528438e-02, 5.72534107e-03, 3.57494495e-01,
        4.92164238e-01, 2.71066388e-01, 7.02209035e-05, 1.42257512e-02,
        1.94808630e-02, 6.04897792e-02],
       [4.17482422e-01, 2.40477320e-02, 8.19228437e-01, 6.52782731e-01,
        2.87493942e-01, 3.12015307e-01, 1.40804989e-01, 1.89627575e-02,
        5.15665493e-02, 6.38875114e-01],
       [1.07644812e-01, 6.56451420e-03, 6.42691279e-01, 5.74562317e-02,
        1.15077050e-01, 3.64748955e-01, 4.33132086e-03, 1.92562224e-01,
        1.93718936e-02, 6.72267787e-01],
       [2.93916017e-01, 1.67100142e-02, 4.53489693e-01, 9.49016309e-01,
        1.22407590e-03, 7.67558948e-01, 6.13205320e-02, 5.20332575e-04,
        3.22980232e-01, 2.65062253e-01],
       [4.48720489e-01, 1.44739619e-02, 1.87280690e-01, 2.37853201e-03,
        7.53553662e-03, 6.04549740e-01, 1.81565483e-01, 3.47138893e-01,
        6.11549402e-02, 3.49103376e-01],
       [7.37479681e-01, 7.99182800e-01, 9.31826572e-01, 3.59724685e-01,
        7.05089649e-05, 8.66795558e-01, 7.69395047e-01, 6.34019395e-01,
        7.03090478e-01, 1.01062218e-02],
       [4.51207047e-01, 1.25325253e-01, 5.55239714e-01, 2.72345449e-02,
        9.56146285e-02, 6.61039047e-01, 5.58980528e-03, 7.54327828e-03,
        1.01854010e-01, 5.63360766e-02],
       [2.28379187e-01, 5.58528300e-01, 6.90941356e-02, 3.53843216e-01,
        5.12063892e-01, 3.53723770e-01, 2.22320868e-01, 1.19212608e-03,
        5.23195808e-01, 2.21170575e-01],
       [8.28143736e-04, 6.90210757e-01, 9.79011906e-02, 9.24283216e-02,
        1.54329906e-02, 3.29694386e-01, 2.07440760e-01, 6.74414434e-01,
        5.80363824e-01, 1.17644387e-02],
       [9.23261684e-01, 7.03825118e-01, 7.16273522e-01, 7.23787240e-02,
        2.11628030e-02, 5.75915767e-01, 8.22357214e-02, 1.36138086e-01,
        1.42981909e-01, 6.09048026e-01]])

#a2**0.5
np.sqrt(a2)

array([[0.72561923, 0.37724327, 0.27507451, 0.77324539, 0.83758234,
        0.72155413, 0.0915412 , 0.34535743, 0.37359585, 0.49593014],
       [0.80382137, 0.39379349, 0.95137395, 0.89886023, 0.73224628,
        0.74738431, 0.61256829, 0.37108669, 0.47653194, 0.89403391],
       [0.57279376, 0.28464309, 0.89536601, 0.48959197, 0.58243462,
        0.77713866, 0.25654   , 0.66243424, 0.37307231, 0.905494  ],
       [0.73630172, 0.35953751, 0.82061954, 0.98700288, 0.18704758,
        0.9360045 , 0.49762418, 0.15103245, 0.7538657 , 0.71752489],
       [0.81845343, 0.34685412, 0.6578445 , 0.22083975, 0.29463107,
        0.88177546, 0.6527672 , 0.76758383, 0.49728789, 0.76866749],
       [0.92669648, 0.9455    , 0.98250275, 0.77444853, 0.09163494,
        0.96489303, 0.93656375, 0.89233029, 0.91569914, 0.3170642 ],
       [0.81958494, 0.59498997, 0.86321729, 0.40623752, 0.55607207,
        0.90168901, 0.27343189, 0.29470672, 0.56492986, 0.48718808],
       [0.69129614, 0.86449263, 0.51269643, 0.77126339, 0.8459234 ,
        0.7711983 , 0.68666523, 0.1858149 , 0.85048384, 0.6857753 ],
       [0.16963928, 0.91147641, 0.5593672 , 0.55138033, 0.35246218,
        0.75775339, 0.67487528, 0.90621598, 0.87282079, 0.32933873],
       [0.98023725, 0.91593824, 0.91996165, 0.51868386, 0.38141107,
        0.87114358, 0.53550697, 0.60742814, 0.61492234, 0.88341117]])

Built-in functions#

Operate over entire array, specified axes, or slice
Very fast/efficient

a2.mean()

0.46351243272781467

a2.std()

0.2922564296409726

a2.min()

0.008379791373509304

a2

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938],
       [0.54214022, 0.12926722, 0.67341643, 0.97417468, 0.0349868 ,
        0.87610442, 0.24762983, 0.0228108 , 0.5683135 , 0.51484197],
       [0.66986602, 0.12030778, 0.43275939, 0.0487702 , 0.08680747,
        0.77752797, 0.42610501, 0.58918494, 0.24729525, 0.59084971],
       [0.85876637, 0.89397025, 0.96531165, 0.59977053, 0.00839696,
        0.93101856, 0.87715167, 0.79625335, 0.83850491, 0.10052971],
       [0.67171947, 0.35401307, 0.74514409, 0.16502892, 0.30921615,
        0.81304308, 0.074765  , 0.08685205, 0.31914575, 0.23735222],
       [0.47789035, 0.74734751, 0.26285763, 0.59484722, 0.7155864 ,
        0.59474681, 0.47150914, 0.03452718, 0.72332275, 0.47028776],
       [0.02877749, 0.83078924, 0.31289166, 0.30402027, 0.12422959,
        0.5741902 , 0.45545665, 0.8212274 , 0.76181614, 0.108464  ],
       [0.96086507, 0.83894286, 0.84632944, 0.26903294, 0.14547441,
        0.75889114, 0.28676771, 0.36896895, 0.37812949, 0.78041529]])

Note on axis order#

When indexing, first axis (0) will extract rows, second axis (1) will extract cols
When aggregating (e.g., computing mean along an axis), you are specifing the dimension of the array that will be collapsed, not the dimension that will be returned
- So axis=0 will aggregate values across all rows for each column in a 2D array

a2[0:3,0:3]

array([[0.52652326, 0.14231249, 0.07566598],
       [0.6461288 , 0.15507331, 0.90511239],
       [0.32809269, 0.08102169, 0.80168028]])

a2[0:3,0:3].min(axis=0)

array([0.32809269, 0.08102169, 0.07566598])

a2[0:3,0:3].min(axis=1)

array([0.07566598, 0.15507331, 0.08102169])

Basic array plotting and visualization#

array([3, 9, 4, 6, 5, 4, 4, 3, 9, 1])

plt.plot(a)

[<matplotlib.lines.Line2D at 0x7f7d7cd5f160>]

../../_images/03_NumPy_Pandas_Demo_64_1.png

a2

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938],
       [0.54214022, 0.12926722, 0.67341643, 0.97417468, 0.0349868 ,
        0.87610442, 0.24762983, 0.0228108 , 0.5683135 , 0.51484197],
       [0.66986602, 0.12030778, 0.43275939, 0.0487702 , 0.08680747,
        0.77752797, 0.42610501, 0.58918494, 0.24729525, 0.59084971],
       [0.85876637, 0.89397025, 0.96531165, 0.59977053, 0.00839696,
        0.93101856, 0.87715167, 0.79625335, 0.83850491, 0.10052971],
       [0.67171947, 0.35401307, 0.74514409, 0.16502892, 0.30921615,
        0.81304308, 0.074765  , 0.08685205, 0.31914575, 0.23735222],
       [0.47789035, 0.74734751, 0.26285763, 0.59484722, 0.7155864 ,
        0.59474681, 0.47150914, 0.03452718, 0.72332275, 0.47028776],
       [0.02877749, 0.83078924, 0.31289166, 0.30402027, 0.12422959,
        0.5741902 , 0.45545665, 0.8212274 , 0.76181614, 0.108464  ],
       [0.96086507, 0.83894286, 0.84632944, 0.26903294, 0.14547441,
        0.75889114, 0.28676771, 0.36896895, 0.37812949, 0.78041529]])

plt.plot(a2)

[<matplotlib.lines.Line2D at 0x7f7d7cbd3f10>,
 <matplotlib.lines.Line2D at 0x7f7d7cbd3f70>,
 <matplotlib.lines.Line2D at 0x7f7d7cb610d0>,
 <matplotlib.lines.Line2D at 0x7f7d7cb611f0>,
 <matplotlib.lines.Line2D at 0x7f7d7cb61310>,
 <matplotlib.lines.Line2D at 0x7f7d7cb61430>,
 <matplotlib.lines.Line2D at 0x7f7d7cb61550>,
 <matplotlib.lines.Line2D at 0x7f7d7cb61670>,
 <matplotlib.lines.Line2D at 0x7f7d7cb61790>,
 <matplotlib.lines.Line2D at 0x7f7d7cb618b0>]

../../_images/03_NumPy_Pandas_Demo_66_1.png

plt.plot(a2[0])

[<matplotlib.lines.Line2D at 0x7f7d7cb5dd60>]

../../_images/03_NumPy_Pandas_Demo_67_1.png

#2D array visualization
plt.imshow(a2, cmap='gray')
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x7f7d7c27b4f0>

../../_images/03_NumPy_Pandas_Demo_68_1.png

plt.hist(a2.ravel(), bins='auto')

(array([18., 14., 11.,  9., 15.,  7., 17.,  9.]),
 array([0.00837979, 0.12910415, 0.24982851, 0.37055287, 0.49127724,
        0.6120016 , 0.73272596, 0.85345032, 0.97417468]),
 <BarContainer object of 8 artists>)

../../_images/03_NumPy_Pandas_Demo_69_1.png

Boolean arrays and fancy indexing#

a2

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938],
       [0.54214022, 0.12926722, 0.67341643, 0.97417468, 0.0349868 ,
        0.87610442, 0.24762983, 0.0228108 , 0.5683135 , 0.51484197],
       [0.66986602, 0.12030778, 0.43275939, 0.0487702 , 0.08680747,
        0.77752797, 0.42610501, 0.58918494, 0.24729525, 0.59084971],
       [0.85876637, 0.89397025, 0.96531165, 0.59977053, 0.00839696,
        0.93101856, 0.87715167, 0.79625335, 0.83850491, 0.10052971],
       [0.67171947, 0.35401307, 0.74514409, 0.16502892, 0.30921615,
        0.81304308, 0.074765  , 0.08685205, 0.31914575, 0.23735222],
       [0.47789035, 0.74734751, 0.26285763, 0.59484722, 0.7155864 ,
        0.59474681, 0.47150914, 0.03452718, 0.72332275, 0.47028776],
       [0.02877749, 0.83078924, 0.31289166, 0.30402027, 0.12422959,
        0.5741902 , 0.45545665, 0.8212274 , 0.76181614, 0.108464  ],
       [0.96086507, 0.83894286, 0.84632944, 0.26903294, 0.14547441,
        0.75889114, 0.28676771, 0.36896895, 0.37812949, 0.78041529]])

a2 > 0.5

array([[ True, False, False,  True,  True,  True, False, False, False,
        False],
       [ True, False,  True,  True,  True,  True, False, False, False,
         True],
       [False, False,  True, False, False,  True, False, False, False,
         True],
       [ True, False,  True,  True, False,  True, False, False,  True,
         True],
       [ True, False, False, False, False,  True, False,  True, False,
         True],
       [ True,  True,  True,  True, False,  True,  True,  True,  True,
        False],
       [ True, False,  True, False, False,  True, False, False, False,
        False],
       [False,  True, False,  True,  True,  True, False, False,  True,
        False],
       [False,  True, False, False, False,  True, False,  True,  True,
        False],
       [ True,  True,  True, False, False,  True, False, False, False,
         True]])

idx = (a2 > 0.5)

idx

array([[ True, False, False,  True,  True,  True, False, False, False,
        False],
       [ True, False,  True,  True,  True,  True, False, False, False,
         True],
       [False, False,  True, False, False,  True, False, False, False,
         True],
       [ True, False,  True,  True, False,  True, False, False,  True,
         True],
       [ True, False, False, False, False,  True, False,  True, False,
         True],
       [ True,  True,  True,  True, False,  True,  True,  True,  True,
        False],
       [ True, False,  True, False, False,  True, False, False, False,
        False],
       [False,  True, False,  True,  True,  True, False, False,  True,
        False],
       [False,  True, False, False, False,  True, False,  True,  True,
        False],
       [ True,  True,  True, False, False,  True, False, False, False,
         True]])

# Quick visualization, True = yellow (1)
plt.imshow(idx)
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x7f7d7c13b910>

../../_images/03_NumPy_Pandas_Demo_75_1.png

# Return only elements where condition is True
a2[idx]

array([0.52652326, 0.59790843, 0.70154418, 0.52064036, 0.6461288 ,
       0.90511239, 0.80794971, 0.53618462, 0.5585833 , 0.79929664,
       0.80168028, 0.6039445 , 0.81991938, 0.54214022, 0.67341643,
       0.97417468, 0.87610442, 0.5683135 , 0.51484197, 0.66986602,
       0.77752797, 0.58918494, 0.59084971, 0.85876637, 0.89397025,
       0.96531165, 0.59977053, 0.93101856, 0.87715167, 0.79625335,
       0.83850491, 0.67171947, 0.74514409, 0.81304308, 0.74734751,
       0.59484722, 0.7155864 , 0.59474681, 0.72332275, 0.83078924,
       0.5741902 , 0.8212274 , 0.76181614, 0.96086507, 0.83894286,
       0.84632944, 0.75889114, 0.78041529])

# Original shape
a2.shape

(10, 10)

# Selected shape
a2[idx].shape
#idx.nonzero()[0].size

(48,)

#Can also be used for assignment
#a2[idx] = 0

### Bitwise operators, combining boolean arrays

(a2 > 0.5)

array([[ True, False, False,  True,  True,  True, False, False, False,
        False],
       [ True, False,  True,  True,  True,  True, False, False, False,
         True],
       [False, False,  True, False, False,  True, False, False, False,
         True],
       [ True, False,  True,  True, False,  True, False, False,  True,
         True],
       [ True, False, False, False, False,  True, False,  True, False,
         True],
       [ True,  True,  True,  True, False,  True,  True,  True,  True,
        False],
       [ True, False,  True, False, False,  True, False, False, False,
        False],
       [False,  True, False,  True,  True,  True, False, False,  True,
        False],
       [False,  True, False, False, False,  True, False,  True,  True,
        False],
       [ True,  True,  True, False, False,  True, False, False, False,
         True]])

(a2 < 0.7)

array([[ True,  True,  True,  True, False,  True,  True,  True,  True,
         True],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
        False],
       [ True,  True, False,  True,  True,  True,  True,  True,  True,
        False],
       [ True,  True,  True, False,  True, False,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True, False,  True,  True,  True,
         True],
       [False, False, False,  True,  True, False, False, False, False,
         True],
       [ True,  True, False,  True,  True, False,  True,  True,  True,
         True],
       [ True, False,  True,  True, False,  True,  True,  True, False,
         True],
       [ True, False,  True,  True,  True,  True,  True, False, False,
         True],
       [False, False, False,  True,  True, False,  True,  True,  True,
        False]])

#Bitwise and - True if both are True
idx = (a2 > 0.5) & (a2 < 0.7)
#Bitwise or - True if either is True
#idx = (a < 0.5) | (a > 0.9)

idx

array([[ True, False, False,  True, False,  True, False, False, False,
        False],
       [ True, False, False, False,  True,  True, False, False, False,
        False],
       [False, False, False, False, False,  True, False, False, False,
        False],
       [ True, False,  True, False, False, False, False, False,  True,
         True],
       [ True, False, False, False, False, False, False,  True, False,
         True],
       [False, False, False,  True, False, False, False, False, False,
        False],
       [ True, False, False, False, False, False, False, False, False,
        False],
       [False, False, False,  True, False,  True, False, False, False,
        False],
       [False, False, False, False, False,  True, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False]])

plt.imshow(idx)

<matplotlib.image.AxesImage at 0x7f7d7c0c5700>

../../_images/03_NumPy_Pandas_Demo_85_1.png

a2[idx]

array([0.52652326, 0.59790843, 0.52064036, 0.6461288 , 0.53618462,
       0.5585833 , 0.6039445 , 0.54214022, 0.67341643, 0.5683135 ,
       0.51484197, 0.66986602, 0.58918494, 0.59084971, 0.59977053,
       0.67171947, 0.59484722, 0.59474681, 0.5741902 ])

a2[idx].shape

(19,)

#Invert the boolean array
~idx

array([[False,  True,  True, False,  True, False,  True,  True,  True,
         True],
       [False,  True,  True,  True, False, False,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True, False,  True,  True,  True,
         True],
       [False,  True, False,  True,  True,  True,  True,  True, False,
        False],
       [False,  True,  True,  True,  True,  True,  True, False,  True,
        False],
       [ True,  True,  True, False,  True,  True,  True,  True,  True,
         True],
       [False,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True, False,  True, False,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True, False,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]])

plt.imshow(idx)

<matplotlib.image.AxesImage at 0x7f7d7c0b99a0>

../../_images/03_NumPy_Pandas_Demo_89_1.png

plt.imshow(~idx)

<matplotlib.image.AxesImage at 0x7f7d7c0fdbe0>

../../_images/03_NumPy_Pandas_Demo_90_1.png

a2[~idx].shape

(81,)

Masked array#

a2

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938],
       [0.54214022, 0.12926722, 0.67341643, 0.97417468, 0.0349868 ,
        0.87610442, 0.24762983, 0.0228108 , 0.5683135 , 0.51484197],
       [0.66986602, 0.12030778, 0.43275939, 0.0487702 , 0.08680747,
        0.77752797, 0.42610501, 0.58918494, 0.24729525, 0.59084971],
       [0.85876637, 0.89397025, 0.96531165, 0.59977053, 0.00839696,
        0.93101856, 0.87715167, 0.79625335, 0.83850491, 0.10052971],
       [0.67171947, 0.35401307, 0.74514409, 0.16502892, 0.30921615,
        0.81304308, 0.074765  , 0.08685205, 0.31914575, 0.23735222],
       [0.47789035, 0.74734751, 0.26285763, 0.59484722, 0.7155864 ,
        0.59474681, 0.47150914, 0.03452718, 0.72332275, 0.47028776],
       [0.02877749, 0.83078924, 0.31289166, 0.30402027, 0.12422959,
        0.5741902 , 0.45545665, 0.8212274 , 0.76181614, 0.108464  ],
       [0.96086507, 0.83894286, 0.84632944, 0.26903294, 0.14547441,
        0.75889114, 0.28676771, 0.36896895, 0.37812949, 0.78041529]])

idx

array([[ True, False, False,  True, False,  True, False, False, False,
        False],
       [ True, False, False, False,  True,  True, False, False, False,
        False],
       [False, False, False, False, False,  True, False, False, False,
        False],
       [ True, False,  True, False, False, False, False, False,  True,
         True],
       [ True, False, False, False, False, False, False,  True, False,
         True],
       [False, False, False,  True, False, False, False, False, False,
        False],
       [ True, False, False, False, False, False, False, False, False,
        False],
       [False, False, False,  True, False,  True, False, False, False,
        False],
       [False, False, False, False, False,  True, False, False, False,
        False],
       [False, False, False, False, False, False, False, False, False,
        False]])

#np.ma.array

ma = np.ma.array(a2, mask=~idx)

ma

masked_array(
  data=[[0.5265232608453788, --, --, 0.5979084337784969, --,
         0.5206403629685025, --, --, --, --],
        [0.6461287968711238, --, --, --, 0.5361846154058071,
         0.558583303937883, --, --, --, --],
        [--, --, --, --, --, 0.6039444969658847, --, --, --, --],
        [0.5421402189101205, --, 0.6734164335026674, --, --, --, --, --,
         0.5683134980882404, 0.5148419689595748],
        [0.6698660228152108, --, --, --, --, --, --, 0.5891849393173709,
         --, 0.5908497065541548],
        [--, --, --, 0.5997705271856745, --, --, --, --, --, --],
        [0.6717194703376138, --, --, --, --, --, --, --, --, --],
        [--, --, --, 0.5948472205118084, --, 0.594746811681392, --, --,
         --, --],
        [--, --, --, --, --, 0.5741901997787556, --, --, --, --],
        [--, --, --, --, --, --, --, --, --, --]],
  mask=[[False,  True,  True, False,  True, False,  True,  True,  True,
          True],
        [False,  True,  True,  True, False, False,  True,  True,  True,
          True],
        [ True,  True,  True,  True,  True, False,  True,  True,  True,
          True],
        [False,  True, False,  True,  True,  True,  True,  True, False,
         False],
        [False,  True,  True,  True,  True,  True,  True, False,  True,
         False],
        [ True,  True,  True, False,  True,  True,  True,  True,  True,
          True],
        [False,  True,  True,  True,  True,  True,  True,  True,  True,
          True],
        [ True,  True,  True, False,  True, False,  True,  True,  True,
          True],
        [ True,  True,  True,  True,  True, False,  True,  True,  True,
          True],
        [ True,  True,  True,  True,  True,  True,  True,  True,  True,
          True]],
  fill_value=1e+20)

a2.mean()

0.46351243272781467

ma.mean()

0.5880947520218769

plt.imshow(ma)

<matplotlib.image.AxesImage at 0x7f7d7c1bceb0>

../../_images/03_NumPy_Pandas_Demo_100_1.png

Masked array vs. np.nan#

We will come back to this in Lab05
Useful for representing nodata in raster datasets
Useful for dynamically masking outliers for calculations without removing values or creating new arrays
np.nan is float32 or float64, so if your original array is int8, much more efficient to use additional 1-bit mask than cast everything as float32

Pandas!#

#pd.DataFrame?

df = pd.DataFrame(a2)

df

	0	1	2	3	4	5	6	7	8	9
0	0.526523	0.142312	0.075666	0.597908	0.701544	0.520640	0.008380	0.119272	0.139574	0.245947
1	0.646129	0.155073	0.905112	0.807950	0.536185	0.558583	0.375240	0.137705	0.227083	0.799297
2	0.328093	0.081022	0.801680	0.239700	0.339230	0.603944	0.065813	0.438819	0.139183	0.819919
3	0.542140	0.129267	0.673416	0.974175	0.034987	0.876104	0.247630	0.022811	0.568313	0.514842
4	0.669866	0.120308	0.432759	0.048770	0.086807	0.777528	0.426105	0.589185	0.247295	0.590850
5	0.858766	0.893970	0.965312	0.599771	0.008397	0.931019	0.877152	0.796253	0.838505	0.100530
6	0.671719	0.354013	0.745144	0.165029	0.309216	0.813043	0.074765	0.086852	0.319146	0.237352
7	0.477890	0.747348	0.262858	0.594847	0.715586	0.594747	0.471509	0.034527	0.723323	0.470288
8	0.028777	0.830789	0.312892	0.304020	0.124230	0.574190	0.455457	0.821227	0.761816	0.108464
9	0.960865	0.838943	0.846329	0.269033	0.145474	0.758891	0.286768	0.368969	0.378129	0.780415

df.index = ['a','b','c','d','e','f','g','h','i','j']

df

	0	1	2	3	4	5	6	7	8	9
a	0.526523	0.142312	0.075666	0.597908	0.701544	0.520640	0.008380	0.119272	0.139574	0.245947
b	0.646129	0.155073	0.905112	0.807950	0.536185	0.558583	0.375240	0.137705	0.227083	0.799297
c	0.328093	0.081022	0.801680	0.239700	0.339230	0.603944	0.065813	0.438819	0.139183	0.819919
d	0.542140	0.129267	0.673416	0.974175	0.034987	0.876104	0.247630	0.022811	0.568313	0.514842
e	0.669866	0.120308	0.432759	0.048770	0.086807	0.777528	0.426105	0.589185	0.247295	0.590850
f	0.858766	0.893970	0.965312	0.599771	0.008397	0.931019	0.877152	0.796253	0.838505	0.100530
g	0.671719	0.354013	0.745144	0.165029	0.309216	0.813043	0.074765	0.086852	0.319146	0.237352
h	0.477890	0.747348	0.262858	0.594847	0.715586	0.594747	0.471509	0.034527	0.723323	0.470288
i	0.028777	0.830789	0.312892	0.304020	0.124230	0.574190	0.455457	0.821227	0.761816	0.108464
j	0.960865	0.838943	0.846329	0.269033	0.145474	0.758891	0.286768	0.368969	0.378129	0.780415

# Still just NumPy array under the hood
df.values

array([[0.52652326, 0.14231249, 0.07566598, 0.59790843, 0.70154418,
        0.52064036, 0.00837979, 0.11927175, 0.13957386, 0.2459467 ],
       [0.6461288 , 0.15507331, 0.90511239, 0.80794971, 0.53618462,
        0.5585833 , 0.37523991, 0.13770533, 0.22708269, 0.79929664],
       [0.32809269, 0.08102169, 0.80168028, 0.2397003 , 0.33923008,
        0.6039445 , 0.06581277, 0.43881912, 0.13918295, 0.81991938],
       [0.54214022, 0.12926722, 0.67341643, 0.97417468, 0.0349868 ,
        0.87610442, 0.24762983, 0.0228108 , 0.5683135 , 0.51484197],
       [0.66986602, 0.12030778, 0.43275939, 0.0487702 , 0.08680747,
        0.77752797, 0.42610501, 0.58918494, 0.24729525, 0.59084971],
       [0.85876637, 0.89397025, 0.96531165, 0.59977053, 0.00839696,
        0.93101856, 0.87715167, 0.79625335, 0.83850491, 0.10052971],
       [0.67171947, 0.35401307, 0.74514409, 0.16502892, 0.30921615,
        0.81304308, 0.074765  , 0.08685205, 0.31914575, 0.23735222],
       [0.47789035, 0.74734751, 0.26285763, 0.59484722, 0.7155864 ,
        0.59474681, 0.47150914, 0.03452718, 0.72332275, 0.47028776],
       [0.02877749, 0.83078924, 0.31289166, 0.30402027, 0.12422959,
        0.5741902 , 0.45545665, 0.8212274 , 0.76181614, 0.108464  ],
       [0.96086507, 0.83894286, 0.84632944, 0.26903294, 0.14547441,
        0.75889114, 0.28676771, 0.36896895, 0.37812949, 0.78041529]])

df.index.values

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype=object)

# Mean of each column
df.mean()

  0.571077
  0.429305
  0.602117
  0.460120
  0.300166
  0.700869
  0.328882
  0.341562
  0.434237
  0.466790
dtype: float64

# Mean of each row
df.mean(axis=1)

a    0.307777
b    0.514836
c    0.385740
d    0.458369
e    0.398947
f    0.686967
g    0.377628
h    0.509292
i    0.432186
j    0.563382
dtype: float64

Reading files with Pandas#

Most of the time, you will read in tabular data and let Pandas do the work

# Relative path to csv from Lab01
csv_fn = '../01_Shell_Github/data/GLAH14_tllz_conus_lulcfilt_demfilt.csv'

# Quick check using shell `head` command
!head -n 5 $csv_fn

decyear,ordinal,lat,lon,glas_z,dem_z,dem_z_std,lulc
13957078,731266.9433448168,44.157897,-105.356562,1398.51,1400.52,0.33,31
13957081,731266.9433462636,44.150175,-105.358116,1387.11,1384.64,0.43,31
13957081,731266.9433465529,44.148632,-105.358427,1392.83,1383.49,0.28,31
13957081,731266.9433468423,44.147087,-105.358738,1384.24,1382.85,0.84,31

pd.read_csv(csv_fn)

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
0	2003.139571	731266.943345	44.157897	-105.356562	1398.51	1400.52	0.33	31
1	2003.139571	731266.943346	44.150175	-105.358116	1387.11	1384.64	0.43	31
2	2003.139571	731266.943347	44.148632	-105.358427	1392.83	1383.49	0.28	31
3	2003.139571	731266.943347	44.147087	-105.358738	1384.24	1382.85	0.84	31
4	2003.139571	731266.943347	44.145542	-105.359048	1369.21	1380.24	1.73	31
...	...	...	...	...	...	...	...	...
65231	2009.775995	733691.238340	37.896222	-117.044399	1556.16	1556.43	0.00	31
65232	2009.775995	733691.238340	37.897769	-117.044675	1556.02	1556.43	0.00	31
65233	2009.775995	733691.238340	37.899319	-117.044952	1556.19	1556.44	0.00	31
65234	2009.775995	733691.238340	37.900869	-117.045230	1556.18	1556.44	0.00	31
65235	2009.775995	733691.238341	37.902420	-117.045508	1556.32	1556.44	0.00	31

65236 rows × 8 columns

# Store output as a new Pandas DataFrame
glas_df = pd.read_csv(csv_fn)
glas_df

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
0	2003.139571	731266.943345	44.157897	-105.356562	1398.51	1400.52	0.33	31
1	2003.139571	731266.943346	44.150175	-105.358116	1387.11	1384.64	0.43	31
2	2003.139571	731266.943347	44.148632	-105.358427	1392.83	1383.49	0.28	31
3	2003.139571	731266.943347	44.147087	-105.358738	1384.24	1382.85	0.84	31
4	2003.139571	731266.943347	44.145542	-105.359048	1369.21	1380.24	1.73	31
...	...	...	...	...	...	...	...	...
65231	2009.775995	733691.238340	37.896222	-117.044399	1556.16	1556.43	0.00	31
65232	2009.775995	733691.238340	37.897769	-117.044675	1556.02	1556.43	0.00	31
65233	2009.775995	733691.238340	37.899319	-117.044952	1556.19	1556.44	0.00	31
65234	2009.775995	733691.238340	37.900869	-117.045230	1556.18	1556.44	0.00	31
65235	2009.775995	733691.238341	37.902420	-117.045508	1556.32	1556.44	0.00	31

65236 rows × 8 columns

type(glas_df)

pandas.core.frame.DataFrame

# For demonstration purpuoses - multiply index to illustrate difference between loc and iloc
glas_df.set_index(glas_df.index*10+1, inplace=True)
glas_df

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
1	2003.139571	731266.943345	44.157897	-105.356562	1398.51	1400.52	0.33	31
11	2003.139571	731266.943346	44.150175	-105.358116	1387.11	1384.64	0.43	31
21	2003.139571	731266.943347	44.148632	-105.358427	1392.83	1383.49	0.28	31
31	2003.139571	731266.943347	44.147087	-105.358738	1384.24	1382.85	0.84	31
41	2003.139571	731266.943347	44.145542	-105.359048	1369.21	1380.24	1.73	31
...	...	...	...	...	...	...	...	...
652311	2009.775995	733691.238340	37.896222	-117.044399	1556.16	1556.43	0.00	31
652321	2009.775995	733691.238340	37.897769	-117.044675	1556.02	1556.43	0.00	31
652331	2009.775995	733691.238340	37.899319	-117.044952	1556.19	1556.44	0.00	31
652341	2009.775995	733691.238340	37.900869	-117.045230	1556.18	1556.44	0.00	31
652351	2009.775995	733691.238341	37.902420	-117.045508	1556.32	1556.44	0.00	31

65236 rows × 8 columns

# Awesome descriptive statistics for each column
glas_df.describe()

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
count	65236.000000	65236.000000	65236.000000	65236.000000	65236.000000	65236.000000	65236.000000	65236.000000
mean	2005.945322	732291.890372	40.946798	-115.040612	1791.494167	1792.260964	5.504748	30.339444
std	1.729573	631.766682	3.590476	5.465065	1037.183482	1037.925371	7.518558	3.480576
min	2003.139571	731266.943345	34.999455	-124.482406	-115.550000	-114.570000	0.000000	12.000000
25%	2004.444817	731743.803182	38.101451	-119.257599	1166.970000	1168.240000	0.070000	31.000000
50%	2005.846896	732256.116938	39.884541	-115.686241	1555.730000	1556.380000	1.350000	31.000000
75%	2007.223249	732758.486046	43.453565	-109.816475	2399.355000	2400.072500	9.530000	31.000000
max	2009.775995	733691.238341	48.999727	-104.052336	4340.310000	4252.940000	49.900000	31.000000

Indexing and selecting#

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing

# Integer indexing like NumPy
glas_df.iloc[2]

decyear        2003.139571
ordinal      731266.943347
lat              44.148632
lon            -105.358427
glas_z         1392.830000
dem_z          1383.490000
dem_z_std         0.280000
lulc             31.000000
Name: 21, dtype: float64

glas_df.iloc[0:3]

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
1	2003.139571	731266.943345	44.157897	-105.356562	1398.51	1400.52	0.33	31
11	2003.139571	731266.943346	44.150175	-105.358116	1387.11	1384.64	0.43	31
21	2003.139571	731266.943347	44.148632	-105.358427	1392.83	1383.49	0.28	31

glas_df.loc[21]

decyear        2003.139571
ordinal      731266.943347
lat              44.148632
lon            -105.358427
glas_z         1392.830000
dem_z          1383.490000
dem_z_std         0.280000
lulc             31.000000
Name: 21, dtype: float64

# Get labeled indices between 0 and 20
glas_df.loc[0:20]

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
1	2003.139571	731266.943345	44.157897	-105.356562	1398.51	1400.52	0.33	31
11	2003.139571	731266.943346	44.150175	-105.358116	1387.11	1384.64	0.43	31

# Get integer indices between 0 and 20
glas_df.iloc[0:20]

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
1	2003.139571	731266.943345	44.157897	-105.356562	1398.51	1400.52	0.33	31
11	2003.139571	731266.943346	44.150175	-105.358116	1387.11	1384.64	0.43	31
21	2003.139571	731266.943347	44.148632	-105.358427	1392.83	1383.49	0.28	31
31	2003.139571	731266.943347	44.147087	-105.358738	1384.24	1382.85	0.84	31
41	2003.139571	731266.943347	44.145542	-105.359048	1369.21	1380.24	1.73	31
51	2003.139571	731266.943347	44.143996	-105.359359	1366.60	1375.23	1.60	31
61	2003.139571	731266.943351	44.126969	-105.362876	1355.14	1379.38	2.17	31
71	2003.139571	731266.943360	44.074358	-105.373549	1369.53	1391.71	2.88	31
81	2003.139571	731266.943361	44.072806	-105.373864	1380.02	1387.79	0.45	31
91	2003.139571	731266.943361	44.071256	-105.374177	1391.47	1396.90	1.56	31
101	2003.139571	731266.943362	44.063515	-105.375712	1388.58	1408.54	0.24	31
111	2003.139571	731266.943363	44.061967	-105.376015	1372.55	1406.21	0.17	31
121	2003.139571	731266.943364	44.057328	-105.376934	1402.38	1406.23	0.33	31
131	2003.139571	731266.943364	44.055780	-105.377243	1401.82	1405.75	0.35	31
141	2003.139571	731266.943364	44.054231	-105.377553	1399.31	1406.05	0.68	31
151	2003.139571	731266.943366	44.046487	-105.379115	1394.22	1398.14	0.27	31
161	2003.139571	731266.943366	44.044941	-105.379430	1394.94	1400.58	0.17	31
171	2003.139571	731266.943367	44.041850	-105.380064	1386.00	1389.69	0.57	31
181	2003.139571	731266.943424	43.737000	-105.441568	1496.53	1498.16	1.52	31
191	2003.139571	731266.943429	43.706060	-105.447754	1459.99	1460.90	0.08	31

Selecting columns#

glas_df.columns

Index(['decyear', 'ordinal', 'lat', 'lon', 'glas_z', 'dem_z', 'dem_z_std',
       'lulc'],
      dtype='object')

glas_df['glas_z']

1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64

glas_df.glas_z

1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64

glas_df.iloc[:,4]

1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64

glas_df.loc[:,'glas_z']

1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64

#Multiple columns
glas_df['glas_z', 'dem_z']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/srv/conda/envs/notebook/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/srv/conda/envs/notebook/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('glas_z', 'dem_z')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_512/2880611682.py in <module>
      1 #Multiple columns
----> 2 glas_df['glas_z', 'dem_z']

/srv/conda/envs/notebook/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

/srv/conda/envs/notebook/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: ('glas_z', 'dem_z')

# Need to pass in a list of column names
glas_df[['glas_z', 'dem_z']]

	glas_z	dem_z
1	1398.51	1400.52
11	1387.11	1384.64
21	1392.83	1383.49
31	1384.24	1382.85
41	1369.21	1380.24
...	...	...
652311	1556.16	1556.43
652321	1556.02	1556.43
652331	1556.19	1556.44
652341	1556.18	1556.44
652351	1556.32	1556.44

65236 rows × 2 columns

glas_df.loc[:,['glas_z', 'dem_z']]

	glas_z	dem_z
1	1398.51	1400.52
11	1387.11	1384.64
21	1392.83	1383.49
31	1384.24	1382.85
41	1369.21	1380.24
...	...	...
652311	1556.16	1556.43
652321	1556.02	1556.43
652331	1556.19	1556.44
652341	1556.18	1556.44
652351	1556.32	1556.44

65236 rows × 2 columns

Boolean indexing#

glas_df['lulc']

1         31
11        31
21        31
31        31
41        31
          ..
652311    31
652321    31
652331    31
652341    31
652351    31
Name: lulc, Length: 65236, dtype: int64

glas_df['lulc'].value_counts()

31    62968
12     2268
Name: lulc, dtype: int64

glas_df['lulc'] == 12

1         False
11        False
21        False
31        False
41        False
          ...  
652311    False
652321    False
652331    False
652341    False
652351    False
Name: lulc, Length: 65236, dtype: bool

# Boolean Series (index and single column) will be True for records with 'lulc' == 12
idx2 = glas_df['lulc'] == 12

type(idx2)

pandas.core.series.Series

idx2.shape

(65236,)

glas_df.shape

(65236, 8)

# Use to select corresponding rows, returns a new DataFrame with all columns
glas_df[idx2]

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std	lulc
231	2003.139573	731266.944184	39.669291	-106.225142	3505.12	3508.25	5.74	12
301	2003.139573	731266.944316	38.961190	-106.355153	4046.47	4047.25	7.14	12
4891	2003.147846	731269.963718	48.587233	-113.484046	2135.76	2123.37	1.18	12
4921	2003.147846	731269.963811	48.091352	-113.595790	1632.52	1615.77	11.43	12
7561	2003.157366	731273.438572	43.897412	-114.457131	2886.39	2889.82	20.31	12
...	...	...	...	...	...	...	...	...
647241	2009.764964	733687.211708	40.689722	-105.918309	3267.33	3267.62	1.83	12
647251	2009.764964	733687.211709	40.694371	-105.919164	3235.77	3238.94	3.78	12
649831	2009.771998	733689.779258	47.910365	-123.628017	1671.86	1711.73	8.44	12
649841	2009.771998	733689.779258	47.908820	-123.628357	1737.70	1776.17	7.70	12
649851	2009.771998	733689.779258	47.907275	-123.628697	1782.52	1828.93	4.41	12

2268 rows × 8 columns

glas_df[idx2].shape

(2268, 8)

glas_df[idx2].mean()

decyear        2006.008627
ordinal      732315.035881
lat              43.065223
lon            -112.936499
glas_z         2918.746261
dem_z          2920.785754
dem_z_std         9.719951
lulc             12.000000
dtype: float64

Groupby#

Let’s consider statistics for groups of rows that share the same column attribute

glas_df.groupby('lulc')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f7d7c2b2a90>

glas_df.groupby('lulc').count()

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std
lulc
12	2268	2268	2268	2268	2268	2268	2268
31	62968	62968	62968	62968	62968	62968	62968

glas_df.groupby('lulc').mean()

	decyear	ordinal	lat	lon	glas_z	dem_z	dem_z_std
lulc
12	2006.008627	732315.035881	43.065223	-112.936499	2918.746261	2920.785754	9.719951
31	2005.943042	732291.056710	40.870496	-115.116398	1750.892469	1751.613426	5.352924

glas_df.groupby('lulc').agg(['mean', 'std'])

	decyear		ordinal		lat		lon		glas_z		dem_z		dem_z_std
	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std	mean	std
lulc
12	2006.008627	1.498488	732315.035881	547.316709	43.065223	3.569772	-112.936499	7.610318	2918.746261	772.429857	2920.785754	769.897983	9.719951	5.805685
31	2005.943042	1.737290	732291.056710	634.586821	40.870496	3.567855	-115.116398	5.356521	1750.892469	1022.544938	1751.613426	1023.340882	5.352924	7.529161

import seaborn as sns
planets = sns.load_dataset('planets')

planets

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300000	7.10	77.40	2006
1	Radial Velocity	1	874.774000	2.21	56.95	2008
2	Radial Velocity	1	763.000000	2.60	19.84	2011
3	Radial Velocity	1	326.030000	19.40	110.62	2007
4	Radial Velocity	1	516.220000	10.50	119.47	2009
...	...	...	...	...	...	...
1030	Transit	1	3.941507	NaN	172.00	2006
1031	Transit	1	2.615864	NaN	148.00	2007
1032	Transit	1	3.191524	NaN	174.00	2007
1033	Transit	1	4.125083	NaN	293.00	2008
1034	Transit	1	4.187757	NaN	260.00	2008

1035 rows × 6 columns

planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

Demo: NumPy, Pandas

Contents

Demo: NumPy, Pandas#

Introduction#

NumPy#

Pandas#

Matplotlib#

Import necessary modules#

NumPy 1D array#

Constructing an array#

Array properties and datatypes#

What is ‘int64’?#

2D arrays#

ufunc#

Built-in functions#

Note on axis order#

Basic array plotting and visualization#

Boolean arrays and fancy indexing#

Masked array#

Masked array vs. np.nan#

Pandas!#

Reading files with Pandas#

Indexing and selecting#

Selecting columns#

Boolean indexing#

Groupby#