 ## Python for Data Analysis Part -1

### NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy’s array objects as the lingua franca for data exchange.
Here are some of the things you’ll find in NumPy:

• ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
• Mathematical functions for fast operations on entire arrays of data without having to write loops.
• Tools for reading/writing array data to disk and working with memory-mapped files.
• Linear algebra, random number generation, and Fourier transform capabilities.
• A C API for connecting NumPy with libraries written in C, C++, or FORTRAN.

For most data analysis applications, the main areas of functionality I’ll focus on are:

• Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
• Common array algorithms like sorting, unique, and set operations
• Efficient descriptive statistics and aggregating/summarizing data
• Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
• Expressing conditional logic as array expressions instead of loops with if-elif-else branches
• Group-wise data manipulations (aggregation, transformation, function application)

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

• NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.
• NumPy operations perform complex computations on entire arrays without the need for Python for loops.

To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list:

In : import numpy as np
In : my_arr = np.arange(5)
my_arr
output: array([0, 1, 2, 3, 4])
In : my_list = list(range(5))
my_list
output: [0, 1, 2, 3, 4]

Now let’s multiply each sequence by 2:

In : %time
for _ in range(10):
my_arr2 = my_2arr * 2
my_arr2
output: Wall time: 0 ns
array([0, 2, 4, 6, 8])
In : %time
for _ in range(10):
my_list2 = [x * 2 for x in my_list]
my_list
output: Wall time: 0 ns
[0, 1, 2, 3, 4]

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

## The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.
To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and generate a small array of random data:

In : import numpy as np
# Generate some random data
In : data = np.random.randn(2, 3)
In : data
output: array([[ 0.79418152, -0.35326436, -1.21186006],
[-1.71029659,  0.54539353,  1.23544625]])
In : data * 10
output: array([[  7.94181524,  -3.53264361, -12.1186006 ],
[-17.10296593,   5.45393533,  12.35446249]])
In : data + data
output: array([[ 1.58836305, -0.70652872, -2.42372012],
[-3.42059319,  1.09078707,  2.4708925 ]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each “cell” in the array have been added to each other.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

In : data.shape
output: (2, 3)
In : data.dtype
output: dtype('float64')

Note: Whenever you see “array,” “NumPy array,” or “ndarray” in the text, with few exceptions they all refer to the same thing: the ndarray object.

## Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion

In : data1 = [6, 7.5, 8, 0, 1]
In : arr1 = np.array(data1)
In : arr1
output: array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional
array:

In : data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
In : arr2 = np.array(data2)
In : arr2
output:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes:

In : arr2.ndim
Output: 2
In : arr2.shape
Output: (2, 4)

Unless explicitly specified (more on this later), np.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have:

In : arr1.dtype
Output: dtype('float64')
In : arr2.dtype
Output: dtype('int64')

In addition to np.array, there are a number of other functions for creating new arrays. As examples, zeros and ones create arrays of 0s or 1s, respectively, with a given length or shape. empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

In : np.zeros(10)
Output: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
In : np.zeros((3, 6))
Output:
array([[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0.]])
In : np.empty((2, 3, 2))
Output:
array([[[1.16983965e-311, 3.16202013e-322],
[0.00000000e+000, 0.00000000e+000],
[1.78022341e-306, 3.92588777e-033]],

[[2.36215612e+180, 1.27913099e+165],
[1.01366871e-070, 1.28080305e+161],
[2.59105025e+184, 4.39258195e+175]]])

It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values. arange is an array-valued version of the built-in Python range function:

In : np.arange(15)
output: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Array creation functions

## Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In : arr1 = np.array([1, 2, 3], dtype=np.float64)
In : arr2 = np.array([1, 2, 3], dtype=np.int32)
In : arr1.dtype
Output: dtype('float64')
In : arr2.dtype
Output: dtype('int32')

Don’t worry about memorizing the NumPy dtypes, especially if you’re a new user. It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, boolean, string, or general Python object. When you need more control over how data are stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type.

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:

In : arr = np.array([1, 2, 3, 4, 5])
In : arr.dtype
Output: dtype('int64')
In : float_arr = arr.astype(np.float64)
In : float_arr.dtype
Output: dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:

In : arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In : arr
Output: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In : arr.astype(np.int32)
Output: array([ 3, -1, -2, 0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

In : numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
In : numeric_strings.astype(float)
Output: array([ 1.25, -9.6 , 42. ])

Note: It’s important to be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data.

Note: Calling astype always creates a new array (a copy of the data), even if the new dtype is the same as the old dtype

## Arithmetic with NumPy Arrays

In : arr = np.array([[1., 2., 3.], [4., 5., 6.]])
In : arr
Output:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
In : arr * arr
Output:
array([[ 1., 4., 9.],
[ 16., 25., 36.]])
In : arr - arr
Output:
array([[ 0., 0., 0.],
[ 0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In : 1 / arr
Output: array([[ 1. , 0.5 , 0.3333],
[ 0.25 , 0.2 , 0.1667]])
In : arr ** 0.5
Output: array([[ 1. , 1.4142, 1.7321],
[ 2. , 2.2361, 2.4495]])

Comparisons between arrays of the same size yield boolean arrays:

In : arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
In : arr2
Output:
array([[ 0., 4., 1.],
[ 7., 2., 12.]])
In : arr2 > arr
Output:
array([[False, True, False],
[ True, False, True]], dtype=bool)


Operations between differently sized arrays is called broadcasting.

## Question

Q1- significance of Numpy