Mastering NumPy for DAM (Data science Ai and Machine Learning)

Welcome to this edition of the python for DAM Series. In the last chapter, ('Introduction to python'), We learnt the python basics necessary for building a solid foundation for Data Science, AI and Machine Learning. Hopefully you have read that and tried your hands on as many practice exercises as possible. In this episode we will be learning a lot about Numerical Python, NumPy.

NumPy is a large open-source library that enables us work with different types of arrays, matrices and various high level mathematical functions. Numpy is key in your DAM journey as it is widely used in the fields of data science, AI and Machine learning (as well as other fields, like engineering, physics etc.)

In the following sessions you will be introduced to a type of data structure called Arrays you will learn how to work with it and perform key functions and operations that will set you up for success as you build you skills in the areas of data science, AI and machine learning.

So, put on your learning hat and be prepared to practice along as I take you on another beautiful data adventure.

Before We Get Started

Installation

First install NumPy if you do not already have it installed. Open your command prompt and run the following code. (If you are using JuPyter notebooks via anaconda as we discussed in chapter 1, just skip this step as you will already have NumPy installed by default.)

pip install numpy

Basics of Arrays

Arrays are a type of data structure similar to lists but unlike lists, they house only the same type of elements (homogeneous), are more memory efficient, have faster access times and perform better than lists. This is very important when we are performing tasks that require high computation and fast response times such as is required in numerical analysis, graphic and gaming, simulations (virtual reality), data processing and analysis, IoT devices, financial trading systems etc.

Some Characteristics of Arrays

Unlike python lists that may contain different data types, python arrays are homogeneous, and this provides better performance for numerical operations.
Array support vectorized operations making them more efficient for large scale data processing.
Arrays can have different dimensions such as 1D, 2D, 3D & nD arrays.

Dimensions

How are arrays represented in python? the code block below, shows us examples of what different types of arrays look like in an IDE.

#example of a 1 dimensional array
1D_array = [1, 2, 3]

#example of a 2 dimensional array
2D_array = [[1,2,3], [4,5,6]]

#example of a 3 dimensional array
3D_array = [[[1,2,3], [4,5,6]]
            [[7,8,9], [3,5,7]]
            [[2,4,6,], [1,8,9]]]

Observe that the 1D array has only one level of bracket indicating only one axis, it is said to be "flat". So, the first array in the code block above is a 1D array with 3 elements.
The 2D array has two levels of brackets (brackets in a bracket) indicating 2 axes (rows and columns). It has a dimension of 2X3 (2 rows and 3 columns) as the array is divided into 2 brackets with 3 elements in each.
The 3D array has three levels of brackets indicating 3 axes. Its dimension is 3X2X3 as it has 3 brackets in the first level of division and in the next level, it is further divided into 2 brackets which contain 3 element each.

Creating Arrays

To ensure that the data type you are working with is an array and not a list, you have to create it as an array. We will be looking at 3 different ways of creating arrays.

using np.array()

 my_array = np.array([[1,2,3], [[4,5,6]])

Creating arrays with zeros, ones or custom values

 #creating a 2x3 array filled with zeros
 zero_array = np.zeros((2,3))
 print(zero_array)

 #creating a 2x2 array filled with ones
 ones_array = np.ones((2,2))
 print(ones_array)

 #creating a 2x3 array filled with a custom value in this case,7
 custom_array = np.full((2,3),7)
 print(custom_array)

output:

 zero_array: [[0. 0. 0.]
             [0. 0. 0.]]

 ones_array: [[1. 1.]
             [1. 1.]]

 custom_array: [[7 7 7]
               [7 7 7]]

Creating arrays that are sequences of numbers.

#creating an array with values from 0 - 9 
seq_arr = np.arange(10) #output [0 1 2 3 4 5 6 7 8 9]

#creating an array with containing 5 points equally spaced between 0 - 1
seq_arr2 = np.linspace(0,1,5) #output [0.   0.25 0.5  0.75 1. ]

Checking Array Attributes

You can easily check for attributes of an array such as the dimension, type and size as follows.

#lets first create the array we'll be using
my_array = np.array([[1,2,3,], [4,5,6]])

#lets check the dimension of the array 
my_array.shape  #your output should be 2,3

#lets check the type 
my_array.dtype #your output should be int64

#lets check the number of elements in the array
my_array.size # your output should be 6

Accessing Elements

Indexes tell us the position of an element in a data structure and in python, the first element is said to be in the '0th' position the next in the 1st position followed by the 2nd position, then 3rd position and so on. You can use the indexes to get any element you need in a data structure as follows.

my_array = np.array([[1,2,3,], [4,5,6]])

#accessing the element in the first row, third column
my_array[0, 2] #output should be 3

#accessing the element in the the second row, all columns
my_array[1, :] #output should be 4,5,6

#accessing all the rows and picking the elements in the second column
my_array[:, 1] #output should be 2,5

Slicing Arrays

With slicing, you can create smaller arrays (subarrays) from larger arrays. Here are a few ways to do it.

#lets first create the array we'll be using
my_array = np.array([[1,2,3,], [4,5,6]])

#this picks all rows and checks for the elements from index 1 and 2
my_array[:, 1:3] #output ([[2, 3],[5, 6]])

#this accesses the last row in the array
my_array[-1]

Boolean Slicing

This gives an array as an output of true and false elements depending on the condition specified.

#lets first create the array we'll be using
my_array = np.array([[1,2,3,], [4,5,6]])

#checks if each element meets the specified condition and return true or false values
my_array > 3 #output ([[False, False, False],[ True,  True,  True]])

Filtering Array Based on Boolean Indexing

This enables us to create an array where our specified condition is met.

#this returns an array where the stated condition is met
condition = my_array > 3
my_array[condition]

Array Operations

Numpy supports a wide range of operations that enable us perform various mathematical and statistical operations. In this section you will be introduce to some of these operations as your skill set expands you will naturally learn more.

Basic operations

Element-wise Arithmetic Operations

#Creating the arrays we'll be using
arr_1 = np.array([1,2,3])
arr_2 = np.array([4,5,6])

#addition
add_arr = arr_1 + arr_2 
print(add_arr)

#subtraction
subt_arr = arr_1 - arr_2 
print(subt_arr)

#multiplication
mult_arr = arr_1 * arr_2 
print(mult_arr)

#division
divd_arr = arr_1 / arr_2
print(divd_arr)

the output should look like this:

arr_1 = [1 2 3]
arr_2 = [4 5 6] 
Addition = [5 7 9]
subtraction = [-3 -3 -3]
multiplication = [4 10 18 ]
division = [0.25 0.4  0.5 ]

Broadcasting

In each of the operations above, the arrays had the same dimensions. when we want to perform operations on arrays of different dimensions, NumPy uses broadcasting to make the smaller array match the size of the larger array.

#lets create the array we'll be using
arr_2d = np.array([[1,2,3,], [4,5,6]])

#adding one element to the 2D array
broadcasted_result = arr_2d + 10 #output: array([[11, 12, 13],[14, 15, 16]])

Mathematical Functions(ufuncs)

Universal functions are common mathematical functions that can be performed elementwise on the entire array without the need to use loops. examples are np.cos, np.sin, np.exp etc

#lets create the array we'll be using 
arr = np.array([1, 2, 3, 4])

#sin of the array above
sine = np.sin(arr) #output: array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 ])

#cos of the array
cosine = np.cos(arr) #output:array([ 0.54030231, -0.41614684, -0.9899925 , -0.65364362])

#square root of the array
square_root = np.sqrt(arr) #output : array([1.        , 1.41421356, 1.73205081, 2.        ])

Aggregation Functions

NumPy supports a variety of aggregation functions such as mean, sum, median etc that we can use to compute statistics. Here are some examples below.

#lets create the array we'll be using
arr_2d = np.array([[1,2,3,], [4,5,6]])

Totalsum = np.sum(arr_2d) #output:21
mean_value = np.mean(arr_2d) #output:3.5
median_value = np.median(arr_2d) #output:3.5

Axis Parameter

We use axis parameter to direct how we want the elements in the array to be aggregated. Axis=0 means column wise While axis = 1 means row wise.

#lets create the array we'll be using
arr_2d = np.array([[1,2,3,], [4,5,6]])

#summing along the column axis
np.sum(arr_2d, axis = 0) #output : array([5, 7, 9])

#summing along the row axis
np.sum(arr_2d, axis = 1)# output : array([ 6, 15])

Random Module

Random numbers help to model uncertainty or variability which is important for simulations and experiments in data science and other fields that require computation. we will be looking at how to create and work with random numbers.

Creating Random Numbers

Usingnp.random.rand()

np.random.rand() helps us to produce random numbers between 0 and 1.

#creating an array of random numbers with the dimension 3x3
np.random.rand(3,3)

the output should look like the ones below, our results may not be exact, but each element should be between 0 and 1.

array([[0.13763363, 0.35381142, 0.00481775],
       [0.66866343, 0.39626891, 0.77774918],
       [0.7471811 , 0.96245029, 0.30222196]])

Usingnp.random.randn

This function generates an array (of a specified shape) with random values from a standard normal distribution.

#specifying the shape as 3x3
np.random.randn(3,3)

the output should look as follows:

array([[ 0.54437669, -0.45676117, -1.15054337],
       [ 0.53277092,  0.02025519,  1.23946618],
       [ 0.49110728,  0.55031084, -0.05885217]])

Setting a Seed for Reproducibility

when we request for random numbers, they are created by an algorithm that use a starting point called a seed. The numbers produced in each run will be different as the algorithm uses a different seed each time. when we specify a seed, it tells the random number generator where to start from, hence every time we use the same seed, the same sequence of random numbers will be produced.

#lets try one without seed
no_seed1 = np.random.rand(2,2)
print(no_seed1)
no seed2 = np.random.rand(2,2)
print(no_seed2)

the output without seed

no_seed1 = array([[0.83650398, 0.84172007],
       [0.81614047, 0.00322946]])
no_seed2 = array([[0.94614863, 0.5852246 ],
       [0.48589582, 0.04137543]])

now using a seed

np.random.seed(42)
with_seed1 = np.random.rand(2,2)
print(with_seed1)

np.random.seed(42)
with_seed2 = np.random.rand(2,2)
print(with_seed2)

the output when a seed is used. we see that each trial produces the exact same array of numbers

with_seed1 = array([[0.37454012, 0.95071431],
       [0.73199394, 0.59865848]])

with_seed2 =array([[0.37454012, 0.95071431],
       [0.73199394, 0.59865848]])

Random Sampling

The np.random.choice() function helps us to generate random numbers from a given array.

#lets create the array we'll be using
array = np.array([1,2,3,4,5])

#creating a random sample from the array we just created
np.random.choice(array, size = 3, replace = False)

replace here ensures that an element is not selected twice while size indicates how many elements, we want in our random sample array.

If you've gotten to this point, you've gone more than halfway through this chapter! Congratulations, you are a few sections from becoming aNumPymaster, well-done and keep going.

Linear Algebra with NumPy

In data science and Machine learning, linear algebra is used for understanding and working with data sets that involve multiple variables. Some of the linear algebra operations we will be talking about in this section are matrix operations, eigen values and eigen vectors.

Matrix Operations

Matrix multiplication

To multiple matrices we can use np.dot or @ to carry out the operation, see the code block below for examples.

#creating the matrices we'll be using
matrix1 = np.array([[1,2], [3,4]])
matrix2 = np.array([[5,6], [7,8]])

#using np.dot()
result_a = np.dot(matrix1, matrix2)
print(result_a)

#using the @ operator
result_b = matrix1 @ matrix2
print(result_b)

The output:

result_a = array([[19, 22],
       [43, 50]])

result_b = array([[19, 22],
       [43, 50]])

Determinant, Inverse & Rank of a Matrix

#creating the matrix we'll be using
matrix3 =np.array([[3,5], [6,7]])

#finding determinant of a matrix
mat_det = np.linalg.det(matrix3)
print(mat_det)

#finding the inverse of a matrix
mat_inv = np.linalg.inv(matrix3)
print(mat_inv)

#finding the rank of a matrix
mat_rank = np.linalg.matrix_rank(matrix3)
print(mat_rank)

mat_det = -8.999999999999998

mat_inv = array([[-0.77777778,  0.55555556],
       [ 0.66666667, -0.33333333]])

mat_rank = 2

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors have a lot of applications in DAM some of which include face recognition, page rank algorithms, quantum computing and many others which you will learn about as you grow in your area of specialization. here is an example of how to find eigenvectors.

#creating the array we'll be using
matrix4 = np.array([[1,2], [2,3]])

#finding the eigenvalues/eigenvectors
eigenvalue, eigenvector = np.linalg.eig(matrix4)
print('eignevalues', eigenvalues)
print('eigenvectors', eigenvectors)

output

eignevalues [-0.23606798  4.23606798]
eigenvectors [[-0.85065081 -0.52573111]
 [ 0.52573111 -0.85065081]]

I added the strings 'eigenvalues' and 'eigenvectors' in the print statement so that we can tell the values apart. Be sure to run the eigen(values/vectors) code with the corresponding print statement in the same block of code when if you are using JuPyter notebook, if not you may run into Name error as I did when putting this tutorial together.

Advanced Topics

There are several advanced operations in numpy that you will be exposed to as you grow in your area of specialization. In this section I touch on some of the general advanced topics that you are likely to encounter when working with data

Reshape and Flatten Arrays

There are times you will need to change the shape of the data you are working with. Sometimes the function or algorithm may require the data to be in 1D, 2D or 3D etc. in such cases reshaping and flattening comes handy. See the subsections below for examples.

Reshaping

Let's create a 1D array and reshape it into a 3D array.

#creating a 1D array
arr = np.arange(1,10)

#reshaping it to a 3x3 array
np.reshape(arr, (3,3))

the output:

#arr 
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

#reshaped array
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Flattening

Flattening is used for converting a multidimensional array to a 1D array. Let's illustrate with an example as follows.

#First we create the matrix we'll be using
matrix = np.array([[2,3,4], [5,6,7]])

#flattening the matrix
np.ravel(matrix)

#second method 
matrix.flatten() 

#output
array([2, 3, 4, 5, 6, 7])

Sorting and Searching

Sorting

np.sort() returns a sorted copy of an array while the np.argsort() returns a sorted array in terms of indices.

arr = np.array([3,1,1,2,6,3,9,4,8,7,6,1,2])
#sorting the Array
np.sort(arr)
#output: array([1, 1, 1, 2, 2, 3, 3, 4, 6, 6, 7, 8, 9])

#sorting the array in terms of indices 
np.argsort(arr)
#output : array([ 1,  2, 11,  3, 12,  0,  5,  7,  4, 10,  9,  8,  6], dtype=int64)

Searching

we can use np.where() returns the indices where a specified criteria is met while np.searchsorted() is used on sorted arrays to find the index where a value can be inserted so that the array maintains its sorted state. See the code block below for clarification.

#creating the array we'll be using
arr = np.array([3,5,7,9,7,4,5,5,7,2])
np.where(arr > 5)
#output:(array([2, 3, 4, 8], dtype=int64),)

np.searchsorted(arr, 5)
#output : 6

Data Manipulation

Data manipulation means transforming, cleaning and structuring data to make it ready and easy to work with. This may take various forms as we in this section.

Concatenation and Stacking

Concatenation

Concatenation means combining arrays. There are several ways to do that with numpy which include np.concatenate(), np.hstack, np.vstack. Let's practice with the example below.

#creating the arrays we'll be using
arr1 = np.array([[1,2,3], [4,5,6]])
arr2 = np.array([[5,6,7]])

#concatenating them vertically 
vert_concat = np.concatenate((arr1, arr2), axis=0 )
print(vert_concat)

#concatenating them horizontally 
hor_concat =bnp.concatenate((arr1, arr2), axis=1)
print(hor_concat)

output:

vert_concat : array([[1, 2, 3],
                   [4, 5, 6],
                   [5, 6, 7],
                   [7, 8, 9]])
hor concat:
            array([[1, 2, 3, 5, 6, 7],
                   [4, 5, 6, 7, 8, 9]])

Splitting arrays

Splitting arrays is a versatile and fundamental operation when working with data. in the context of this series, it enables proper model evaluation, facilitates efficient processing of large datasets and supports various data exploration and analysis tasks.

Splitting Arrays into smaller arrays

We can use the np.split() or np.array_split() to achieve this. Let's show that with examples.

#First we create the array we'll be using
arr = np.array([1,9,4,7,6,2,4,1,9,4,0,8,2,5,4,4,8,9,3,0,2,7,4,6,3])
#splitting the array into 5 subarrays
np.split(arr, 5)

output:

[array([1, 9, 4, 7, 6]),
 array([2, 4, 1, 9, 4]),
 array([0, 8, 2, 5, 4]),
 array([4, 8, 9, 3, 0]),
 array([2, 7, 4, 6, 3])]

Unique Values

We can find unique values in an array using np.unique(), forexample;

arr = np.array([1,9,4,7,6,2,4,1,9,4,0,8,2,5,4,4,8,9,3,0,2,7,4,6,3])
np.unique(arr) #output: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Saving And Loading Data

Saving

We use np.save() for saving files. If we add the .npy extension to the name of the file we are saving, it saves it in a special binary format designed for numpy arrays. We can also use np.savetxt() to save it as a text file that is readable by human beings.

#lets create the array we want to save
arr= np.array([[2,6,4,8], [6,2,5,7]])

#saving it using the first method
np.save('saved_array.npy', arr)

#Saving it using ther second method
np.savetxt('saved_array.txt', arr)

Loading

np.load() is used to load a single array from a .npy file while np.loadtxt() is used to load arrays form text files.

loaded_array = np.load('saved_array.npy')
print(loaded array)
#or
loaded_array_txt = np.loadtxt('saved_array.txt')
print(loaded_array_txt)

In Conclusion

In this episode we have:

Been introduced to Arrays, the core data structure in Numpy. We learnt how to create and carry out many operations on arrays.
Looked at the random module where we learnt how to create and use random numbers.
We carried out some linear algebra operations such as matrix operations, eigenvalues and eigen vectors.
We learnt some advanced NumPy topics like reshaping and flattening arrays, sorting and searching.
We touched on the following Data manipulation operations; concatenation, stacking, splitting and unique values.
Then last but not least we learnt to how to save and load arrays.

Phew! that was a lot! you should be really proud of yourself and take a moment to celebrate, but before that see the next last section.

Next Steps

Practice! practice!! practice !!! get your hands on as many practice exercises as possible as this is the best way to lear, remember, you lose what you don't use.
Like we said in the previous episode don't wait until you feel you've learnt everything, always strive to implement what you've learn, you will run into challenges that will require you to do some research, this is very normal even for those with experience. it's your determination during such periods that build into the professional you are aiming to become.

Use can use google, youtube and chatgpt to carry out your research. You can also make reference to The official numpy documentation or the official Numpy quickstart tutorial.

One last thing

It's time to take a moment to celebrate your milestone of mastering NumPy. When you are done, check chapter 3 of this series where we'll be discussing all about Pandas. Until then cheers to you!

Hi, I am Ajoke and i am a data specialist proficient in the use of Python, SQL, excel and Tableau. if you need help with Data related technical writing, training or analytics. you can reach me via email via ajokearegbeyen@gmail.com .

Mastering NumPy for DAM (Data science Ai and Machine Learning)

Learning Python for DAM Series - Chapter 2

Table of contents

Before We Get Started

Installation

Basics of Arrays

Some Characteristics of Arrays

Dimensions

Creating Arrays

Checking Array Attributes

Accessing Elements

Slicing Arrays

Array Operations

Basic operations

Mathematical Functions(ufuncs)

Aggregation Functions

Random Module

Creating Random Numbers

Random Sampling

Linear Algebra with NumPy

Matrix Operations

Eigenvalues and Eigenvectors

Advanced Topics

Reshape and Flatten Arrays

Sorting and Searching

Data Manipulation

Concatenation and Stacking

Splitting arrays

Unique Values

Saving And Loading Data

In Conclusion

Next Steps

One last thing