You are currently viewing Python for Data Analysis  Part -1

Python for Data Analysis Part -1

NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy’s array objects as the lingua franca for data exchange.
Here are some of the things you’ll find in NumPy:

  • ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
  • Mathematical functions for fast operations on entire arrays of data without having to write loops.
  • Tools for reading/writing array data to disk and working with memory-mapped files.
  • Linear algebra, random number generation, and Fourier transform capabilities.
  • A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.

For most data analysis applications, the main areas of functionality I’ll focus on are:

  • Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
  • Common array algorithms like sorting, unique, and set operations
  • Efficient descriptive statistics and aggregating/summarizing data
  • Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
  • Expressing conditional logic as array expressions instead of loops with if-elif-else branches
  • Group-wise data manipulations (aggregation, transformation, function application)

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

  • NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
  • NumPy operations perform complex computations on entire arrays without the need for Python for loops.

To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list:

In [1]: import numpy as np
In [2]: my_arr = np.arange(5)
output: array([0, 1, 2, 3, 4])
In [3]: my_list = list(range(5))
output: [0, 1, 2, 3, 4]

Now let’s multiply each sequence by 2:

In [4]: %time
        for _ in range(10):
           my_arr2 = my_2arr * 2
output: Wall time: 0 ns
        array([0, 2, 4, 6, 8])
In [5]: %time 
        for _ in range(10): 
           my_list2 = [x * 2 for x in my_list]
output: Wall time: 0 ns
        [0, 1, 2, 3, 4]

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.
To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and generate a small array of random data:

In [6]: import numpy as np
# Generate some random data
In [7]: data = np.random.randn(2, 3)
In [8]: data
output: array([[ 0.79418152, -0.35326436, -1.21186006],
        [-1.71029659,  0.54539353,  1.23544625]])
In [9]: data * 10
output: array([[  7.94181524,  -3.53264361, -12.1186006 ],
        [-17.10296593,   5.45393533,  12.35446249]])
In [10]: data + data 
output: array([[ 1.58836305, -0.70652872, -2.42372012],
        [-3.42059319,  1.09078707,  2.4708925 ]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each “cell” in the array have been added to each other.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

In [11]: data.shape
output: (2, 3)
In [12]: data.dtype
output: dtype('float64')

Note: Whenever you see “array,” “NumPy array,” or “ndarray” in the text, with few exceptions they all refer to the same thing: the ndarray object.

Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion

In [13]: data1 = [6, 7.5, 8, 0, 1]
In [14]: arr1 = np.array(data1)
In [15]: arr1
output: array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional

In [16]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
In [17]: arr2 = np.array(data2)
In [18]: arr2
array([[1, 2, 3, 4],
        [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes:

In [19]: arr2.ndim
Output: 2
In [20]: arr2.shape
Output: (2, 4)

Unless explicitly specified (more on this later), np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have:

In [21]: arr1.dtype
Output: dtype('float64')
In [22]: arr2.dtype
Output: dtype('int64')

In addition to np.array, there are a number of other functions for creating new arrays. As examples, zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

In [23]: np.zeros(10)
Output: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In [24]: np.zeros((3, 6))
 array([[ 0., 0., 0., 0., 0., 0.],
        [ 0., 0., 0., 0., 0., 0.],
        [ 0., 0., 0., 0., 0., 0.]])
In [25]: np.empty((2, 3, 2))
array([[[1.16983965e-311, 3.16202013e-322],
        [0.00000000e+000, 0.00000000e+000],
        [1.78022341e-306, 3.92588777e-033]],

       [[2.36215612e+180, 1.27913099e+165],
        [1.01366871e-070, 1.28080305e+161],
        [2.59105025e+184, 4.39258195e+175]]])

It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values. arange is an array-valued version of the built-in Python range function:

In [26]: np.arange(15)
output: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Array creation functions

arrayConvert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default
asarrayConvert input to ndarray, but do not copy if the input is already an ndarray
arangeLike the built-in range but returns an ndarray instead of a list
Produce an array of all 1s with the given shape and dtype; ones_like takes another array and produces a ones array of the same shape and dtype
Like ones and ones_like but producing arrays of 0s instead
Create new arrays by allocating new memory, but do not populate with any values like ones and zeros
Produce an array of the given shape and dtype with all values set to the indicated “fill value” full_like takes another array and produces a filled array of the same shape and dtype
eye, identityCreate a square N × N identity matrix (1s on the diagonal and 0s elsewhere)

Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In [27]: arr1 = np.array([1, 2, 3], dtype=np.float64)
In [28]: arr2 = np.array([1, 2, 3], dtype=np.int32)
In [29]: arr1.dtype
Output: dtype('float64')
In [30]: arr2.dtype
Output: dtype('int32')

Don’t worry about memorizing the NumPy dtypes, especially if you’re a new user. It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, boolean, string, or general Python object. When you need more control over how data are stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type.

TypeType CodeDescription
int8,uint8i1,u1Signed and unsigned 8-bit (1 byte) integer types
int16,uint16i2,u2Signed and unsigned 16-bit integer types
int32,uint32i4,u4Signed and unsigned 32-bit integer types
int64,uint64i8,u8Signed and unsigned 64-bit integer types
float16f2Half-precision floating point
float32f4 or fStandard single-precision floating point; compatible with C float
float64f8 or dStandard double-precision floating point; compatible with C double and Python float object
float128f16 or gExtended-precision floating point
complex64, complex128, complex256c8, c16, c32Complex numbers represented by two 32, 64, or 128 floats, respectively
bool?Boolean type storing True and False values
objectOPython object type; a value can be any Python object
string_SFixed-length ASCII string type (1 byte per character); for example, to create a string dtype with length 10, use ‘S10’
unicode_UFixed-length Unicode type (number of bytes platform specific); same specification semantics as string_ (e.g., ‘U10’)
NumPy data types

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:

In [31]: arr = np.array([1, 2, 3, 4, 5])
In [32]: arr.dtype
Output: dtype('int64')
In [33]: float_arr = arr.astype(np.float64)
In [34]: float_arr.dtype
Output: dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:

In [35]: arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In [36]: arr
Output: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In [37]: arr.astype(np.int32)
Output: array([ 3, -1, -2, 0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

In [38]: numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
In [39]: numeric_strings.astype(float)
Output: array([ 1.25, -9.6 , 42. ])

Note: It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data.

Note: Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as the old dtype

Arithmetic with NumPy Arrays

In [40]: arr = np.array([[1., 2., 3.], [4., 5., 6.]])
In [41]: arr
 array([[ 1., 2., 3.],
        [ 4., 5., 6.]])
In [42]: arr * arr
 array([[ 1., 4., 9.],
        [ 16., 25., 36.]])
In [43]: arr - arr
 array([[ 0., 0., 0.],
        [ 0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [44]: 1 / arr
Output: array([[ 1. , 0.5 , 0.3333],
             [ 0.25 , 0.2 , 0.1667]])
In [45]: arr ** 0.5
Output: array([[ 1. , 1.4142, 1.7321],
              [ 2. , 2.2361, 2.4495]])

Comparisons between arrays of the same size yield boolean arrays:

In [46]: arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
In [47]: arr2
 array([[ 0., 4., 1.],
        [ 7., 2., 12.]])
In [48]: arr2 > arr
 array([[False, True, False],
        [ True, False, True]], dtype=bool)

Operations between differently sized arrays is called broadcasting.


Q1- significance of Numpy


  • Numpy is a general-purpose array processing package.
  • Numpy arrays are faster and more compact than python list.
  • Numpy uses much less memory to store data and it provides a mechanism of specifying the data types.
  • Numpy provides standard trigonometric functions.

Leave a Reply