NumPy Basics: Arrays and Vectorized Computation
NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy’s array objects as the lingua franca for data exchange.
Here are some of the things you’ll find in NumPy:
- ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
- Mathematical functions for fast operations on entire arrays of data without having to write loops.
- Tools for reading/writing array data to disk and working with memory-mapped files.
- Linear algebra, random number generation, and Fourier transform capabilities.
- A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.
For most data analysis applications, the main areas of functionality I’ll focus on are:
- Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with if-elif-else branches
- Group-wise data manipulations (aggregation, transformation, function application)
One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:
- NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operations perform complex computations on entire arrays without the need for Python for loops.
To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list:
In : import numpy as np In : my_arr = np.arange(5) my_arr output: array([0, 1, 2, 3, 4]) In : my_list = list(range(5)) my_list output: [0, 1, 2, 3, 4]
Now let’s multiply each sequence by 2:
In : %time for _ in range(10): my_arr2 = my_2arr * 2 my_arr2 output: Wall time: 0 ns array([0, 2, 4, 6, 8]) In : %time for _ in range(10): my_list2 = [x * 2 for x in my_list] my_list output: Wall time: 0 ns [0, 1, 2, 3, 4]
NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.
The NumPy ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.
To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and generate a small array of random data:
In : import numpy as np # Generate some random data In : data = np.random.randn(2, 3) In : data output: array([[ 0.79418152, -0.35326436, -1.21186006], [-1.71029659, 0.54539353, 1.23544625]])
In : data * 10 output: array([[ 7.94181524, -3.53264361, -12.1186006 ], [-17.10296593, 5.45393533, 12.35446249]]) In : data + data output: array([[ 1.58836305, -0.70652872, -2.42372012], [-3.42059319, 1.09078707, 2.4708925 ]])
In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each “cell” in the array have been added to each other.
An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:
In : data.shape output: (2, 3) In : data.dtype output: dtype('float64')
Note: Whenever you see “array,” “NumPy array,” or “ndarray” in the text, with few exceptions they all refer to the same thing: the ndarray object.
The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion
In : data1 = [6, 7.5, 8, 0, 1] In : arr1 = np.array(data1) In : arr1 output: array([6. , 7.5, 8. , 0. , 1. ])
Nested sequences, like a list of equal-length lists, will be converted into a multidimensional
In : data2 = [[1, 2, 3, 4], [5, 6, 7, 8]] In : arr2 = np.array(data2) In : arr2 output: array([[1, 2, 3, 4], [5, 6, 7, 8]])
Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes:
In : arr2.ndim Output: 2 In : arr2.shape Output: (2, 4)
Unless explicitly specified (more on this later), np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have:
In : arr1.dtype Output: dtype('float64') In : arr2.dtype Output: dtype('int64')
In addition to np.array, there are a number of other functions for creating new arrays. As examples, zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:
In : np.zeros(10) Output: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) In : np.zeros((3, 6)) Output: array([[ 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0.]]) In : np.empty((2, 3, 2)) Output: array([[[1.16983965e-311, 3.16202013e-322], [0.00000000e+000, 0.00000000e+000], [1.78022341e-306, 3.92588777e-033]], [[2.36215612e+180, 1.27913099e+165], [1.01366871e-070, 1.28080305e+161], [2.59105025e+184, 4.39258195e+175]]])
It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values. arange is an array-valued version of the built-in Python range function:
In : np.arange(15) output: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
Array creation functions
|array||Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default|
|asarray||Convert input to ndarray, but do not copy if the input is already an ndarray|
|arange||Like the built-in range but returns an ndarray instead of a list|
|Produce an array of all 1s with the given shape and dtype; ones_like takes another array and produces a ones array of the same shape and dtype|
|Like ones and ones_like but producing arrays of 0s instead|
|Create new arrays by allocating new memory, but do not populate with any values like ones and zeros|
|Produce an array of the given shape and dtype with all values set to the indicated “fill value” full_like takes another array and produces a filled array of the same shape and dtype|
|eye, identity||Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere)|
Data Types for ndarrays
The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:
In : arr1 = np.array([1, 2, 3], dtype=np.float64) In : arr2 = np.array([1, 2, 3], dtype=np.int32) In : arr1.dtype Output: dtype('float64') In : arr2.dtype Output: dtype('int32')
Don’t worry about memorizing the NumPy dtypes, especially if you’re a new user. It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, boolean, string, or general Python object. When you need more control over how data are stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type.
|int8,uint8||i1,u1||Signed and unsigned 8-bit (1 byte) integer types|
|int16,uint16||i2,u2||Signed and unsigned 16-bit integer types|
|int32,uint32||i4,u4||Signed and unsigned 32-bit integer types|
|int64,uint64||i8,u8||Signed and unsigned 64-bit integer types|
|float16||f2||Half-precision floating point|
|float32||f4 or f||Standard single-precision floating point; compatible with C float|
|float64||f8 or d||Standard double-precision floating point; compatible with C double and Python float object|
|float128||f16 or g||Extended-precision floating point|
|complex64, complex128, complex256||c8, c16, c32||Complex numbers represented by two 32, 64, or 128 floats, respectively|
|bool||?||Boolean type storing True and False values|
|object||O||Python object type; a value can be any Python object|
|string_||S||Fixed-length ASCII string type (1 byte per character); for example, to create a string dtype with length 10, use ‘S10’|
|unicode_||U||Fixed-length Unicode type (number of bytes platform specific); same specification semantics as string_ (e.g., ‘U10’)|
You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:
In : arr = np.array([1, 2, 3, 4, 5]) In : arr.dtype Output: dtype('int64') In : float_arr = arr.astype(np.float64) In : float_arr.dtype Output: dtype('float64')
In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:
In : arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1]) In : arr Output: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1]) In : arr.astype(np.int32) Output: array([ 3, -1, -2, 0, 12, 10], dtype=int32)
If you have an array of strings representing numbers, you can use astype to convert them to numeric form:
In : numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_) In : numeric_strings.astype(float) Output: array([ 1.25, -9.6 , 42. ])
Note: It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data.
Note: Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as the old dtype
Arithmetic with NumPy Arrays
In : arr = np.array([[1., 2., 3.], [4., 5., 6.]]) In : arr Output: array([[ 1., 2., 3.], [ 4., 5., 6.]]) In : arr * arr Output: array([[ 1., 4., 9.], [ 16., 25., 36.]]) In : arr - arr Output: array([[ 0., 0., 0.], [ 0., 0., 0.]])
Arithmetic operations with scalars propagate the scalar argument to each element in the array:
In : 1 / arr Output: array([[ 1. , 0.5 , 0.3333], [ 0.25 , 0.2 , 0.1667]]) In : arr ** 0.5 Output: array([[ 1. , 1.4142, 1.7321], [ 2. , 2.2361, 2.4495]])
Comparisons between arrays of the same size yield boolean arrays:
In : arr2 = np.array([[0., 4., 1.], [7., 2., 12.]]) In : arr2 Output: array([[ 0., 4., 1.], [ 7., 2., 12.]]) In : arr2 > arr Output: array([[False, True, False], [ True, False, True]], dtype=bool)
Operations between differently sized arrays is called broadcasting.
Q1- significance of Numpy
- Numpy is a general-purpose array processing package.
- Numpy arrays are faster and more compact than python list.
- Numpy uses much less memory to store data and it provides a mechanism of specifying the data types.
- Numpy provides standard trigonometric functions.