Monday, December 28, 2020

NumPy Illustrated: The Visual Guide to NumPy

2. Matrices, the 2D Arrays

There used to be a dedicated matrix class in NumPy, but it is deprecated now, so I’ll use the words matrix and 2D array interchangeably.

Matrix initialization syntax is similar to that of vectors:

Double parentheses are necessary here because the second positional parameter is reserved for the (optional) dtype (which also accepts integers).

Random matrix generation is also similar to that of vectors:

Two-dimensional indexing syntax is more convenient than that of nested lists:

The “view” sign means that no copying is actually done when slicing an array. When the array is modified, the changes are reflected in the slice as well.

The axis argument

In many operations (e.g., sum) you need to tell NumPy if you want to operate across rows or columns. To have a universal notation that works for an arbitrary number of dimensions, NumPy introduces a notion of axis: The value of the axis argument is, as a matter of fact, the number of the index in question: The first index is axis=0, the second one is axis=1, and so on. So in 2D axis=0 is column-wise andaxis=1 means row-wise.

Matrix arithmetic

In addition to ordinary operators (like +,-,*,/,// and **) which work element-wise, there’s a @ operator that calculates a matrix product:

As a generalization of broadcasting from scalar that we’ve seen already in the first part, NumPy allows mixed operations between a vector and a matrix, and even between two vectors:

Note that in the last example it is a symmetric per-element multiplication. To calculate the outer product using an asymmetric linear algebra matrix multiplication the order of the operands should be reversed:

Row vectors and column vectors

As seen from the example above, in the 2D context, the row and column vectors are treated differently. This contrasts with the usual NumPy practice of having one type of 1D arrays wherever possible (e.g., a[:,j] — the j-th column of a 2D array a— is a 1D array). By default 1D arrays are treated as row vectors in 2D operations, so when multiplying a matrix by a row vector, you can use either shape (n,) or (1, n) — the result will be the same. If you need a column vector, there are a couple of ways to cook it from a 1D array, but surprisingly transpose is not one of them:

Two operations that are capable of making a 2D column vector out of a 1D array are reshaping and indexing with newaxis:

Here the -1 argument tells reshape to calculate one of the dimension sizes automatically and None in the square brackets serves as a shortcut for np.newaxis, which adds an empty axis at the designated place.

So, there’s a total of three types of vectors in NumPy: 1D arrays, 2D row vectors, and 2D column vectors. Here’s a diagram of explicit conversions between those:

By the rules of broadcasting, 1D arrays are implicitly interpreted as 2D row vectors, so it is generally not necessary to convert between those two — thus the corresponding area is shaded.

Matrix manipulations

There are two main functions for joining the arrays:

Those two work fine with stacking matrices only or vectors only, but when it comes to mixed stacking of 1D arrays and matrices, only the vstack works as expected: The hstack generates a dimensions-mismatch error because as described above, the 1D array is interpreted as a row vector, not a column vector. The workaround is either to convert it to a row vector or to use a specialized column_stack function which does it automatically:

The inverse of stacking is splitting:

Matrix replication can be done in two ways: tile acts like copy-pasting and repeat like collated printing:

Specific columns and rows can be deleted like that:

The inverse operation is insert:

The append function, just like hstack, is unable to automatically transpose 1D arrays, so once again, either the vector needs to be reshaped or a dimension added, or column_stack needs to be used instead:

Actually, if all you need to do is add constant values to the border(s) of the array, the (slightly overcomplicated) pad function should suffice:

Meshgrids

The broadcasting rules make it simpler to work with meshgrids. Suppose, you need the following matrix (but of a very large size):

Two obvious approaches are slow, as they use Python loops. The MATLAB way of dealing with such problems is to create a meshgrid:

The meshgrid function accepts an arbitrary set of indices, mgrid — just slices and indices can only generate the complete index ranges. fromfunction calls the provided function just once, with the I and J argument as described above.

But actually, there is a better way to do it in NumPy. There’s no need to spend memory on the whole I and J matrices (even though meshgrid is smart enough to only store references to the original vectors if possible). It is sufficient to store only vectors of the correct shape, and the broadcasting rules take care of the rest:

Without the indexing=’ij’ argument, meshgrid will change the order of the arguments: J, I= np.meshgrid(j, i) — it is an ‘xy’ mode, useful for visualizing 3D plots (see the example from the docs).

Aside from initializing functions over a two- or three-dimensional grid, the meshes can be useful for indexing arrays:

Works with sparse meshgrids, too

Matrix statistics

Just like sum, all the other stats functions (min/max, argmin/argmax, mean/median/percentile, std/var) accept the axis parameter and act accordingly:

np.amin is just an alias of np.min to avoid shadowing the Python min when you write 'from numpy import *'

The argmin and argmax functions in 2D and above have an annoyance of returning the flattened index (of the first instance of the min and max value). To convert it to two coordinates, an unravel_index function is required:

The quantifiers all and any are also aware of the axis argument:

Matrix sorting

As helpful as the axis argument is for the functions listed above, it is as unhelpful for the 2D sorting:

It is just not what you would usually want from sorting a matrix or a spreadsheet: axis is in no way a replacement for the key argument. But luckily, NumPy has several helper functions which allow sorting by a column — or by several columns, if required:

1. a[a[:,0].argsort()] sorts the array by the first column:

Here argsort returns an array of indices of the original array after sorting.

This trick can be repeated, but care must be taken so that the next sort does not mess up the results of the previous one:
a = a[a[:,2].argsort()]
a = a[a[:,1].argsort(kind='stable')]
a = a[a[:,0].argsort(kind='stable')]

2. There’s a helper function lexsort which sorts in the way described above by all available columns, but it always performs row-wise, and the order of rows to be sorted is inverted (i.e., from bottom to top) so its usage is a bit contrived, e.g.
a[np.lexsort(np.flipud(a[2,5].T))] sorts by column 2 first and then (where the values in column 2 are equal) by column 5.
a[np.lexsort(np.flipud(a.T))] sorts by all columns in left-to-right order.

Here flipud flips the matrix in the up-down direction (to be precise, in the axis=0 direction, same as a[::-1,...], where three dots mean “all other dimensions’”— so it’s all of a sudden flipud, not fliplr, that flips the 1D arrays).

3. There also is an order argument to sort, but it is neither fast nor easy to use if you start with an ordinary (unstructured) array.

4. It might be a better option to do it in pandas since this particular operation is way more readable and less error-prone there:
pd.DataFrame(a).sort_values(by=[2,5]).to_numpy() sorts by column 2, then by column 5.
pd.DataFrame(a).sort_values().to_numpy() sorts by all columns in the left-to-right order.



from Hacker News https://ift.tt/3nJC1on

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.