Mastering NumPy for DAM (Data science Ai and Machine Learning)
Learning Python for DAM Series - Chapter 2
Welcome to this edition of the python for DAM Series. In the last chapter, ('Introduction to python'), We learnt the python basics necessary for building a solid foundation for Data Science, AI and Machine Learning. Hopefully you have read that and tried your hands on as many practice exercises as possible. In this episode we will be learning a lot about Numerical Python, NumPy
.
NumPy
is a large open-source library that enables us work with different types of arrays
, matrices and various high level mathematical functions. Numpy is key in your DAM journey as it is widely used in the fields of data science, AI and Machine learning (as well as other fields, like engineering, physics etc.)
In the following sessions you will be introduced to a type of data structure called Arrays
you will learn how to work with it and perform key functions and operations that will set you up for success as you build you skills in the areas of data science, AI and machine learning.
So, put on your learning hat and be prepared to practice along as I take you on another beautiful data adventure.
Before We Get Started
Installation
First install NumPy
if you do not already have it installed. Open your command prompt and run the following code. (If you are using JuPyter notebooks via anaconda as we discussed in chapter 1, just skip this step as you will already have NumPy installed by default.)
pip install numpy
Basics of Arrays
Arrays are a type of data structure similar to lists but unlike lists, they house only the same type of elements (homogeneous), are more memory efficient, have faster access times and perform better than lists. This is very important when we are performing tasks that require high computation and fast response times such as is required in numerical analysis, graphic and gaming, simulations (virtual reality), data processing and analysis, IoT devices, financial trading systems etc.
Some Characteristics of Arrays
Unlike python lists that may contain different data types, python arrays are homogeneous, and this provides better performance for numerical operations.
Array support vectorized operations making them more efficient for large scale data processing.
Arrays can have different dimensions such as 1D, 2D, 3D & nD arrays.
Dimensions
How are arrays represented in python? the code block below, shows us examples of what different types of arrays look like in an IDE.
#example of a 1 dimensional array
1D_array = [1, 2, 3]
#example of a 2 dimensional array
2D_array = [[1,2,3], [4,5,6]]
#example of a 3 dimensional array
3D_array = [[[1,2,3], [4,5,6]]
[[7,8,9], [3,5,7]]
[[2,4,6,], [1,8,9]]]
Observe that the 1D array has only one level of bracket indicating only one axis, it is said to be "flat". So, the first array in the code block above is a 1D array with 3 elements.
The 2D array has two levels of brackets (brackets in a bracket) indicating 2 axes (rows and columns). It has a dimension of 2X3 (2 rows and 3 columns) as the array is divided into 2 brackets with 3 elements in each.
The 3D array has three levels of brackets indicating 3 axes. Its dimension is 3X2X3 as it has 3 brackets in the first level of division and in the next level, it is further divided into 2 brackets which contain 3 element each.
Creating Arrays
To ensure that the data type you are working with is an array and not a list, you have to create it as an array. We will be looking at 3 different ways of creating arrays.
using
np.array()
my_array = np.array([[1,2,3], [[4,5,6]])
Creating arrays with zeros, ones or custom values
#creating a 2x3 array filled with zeros zero_array = np.zeros((2,3)) print(zero_array) #creating a 2x2 array filled with ones ones_array = np.ones((2,2)) print(ones_array) #creating a 2x3 array filled with a custom value in this case,7 custom_array = np.full((2,3),7) print(custom_array)
output:
zero_array: [[0. 0. 0.] [0. 0. 0.]] ones_array: [[1. 1.] [1. 1.]] custom_array: [[7 7 7] [7 7 7]]
Creating arrays that are sequences of numbers.
#creating an array with values from 0 - 9
seq_arr = np.arange(10) #output [0 1 2 3 4 5 6 7 8 9]
#creating an array with containing 5 points equally spaced between 0 - 1
seq_arr2 = np.linspace(0,1,5) #output [0. 0.25 0.5 0.75 1. ]
Checking Array Attributes
You can easily check for attributes of an array such as the dimension, type and size as follows.
#lets first create the array we'll be using
my_array = np.array([[1,2,3,], [4,5,6]])
#lets check the dimension of the array
my_array.shape #your output should be 2,3
#lets check the type
my_array.dtype #your output should be int64
#lets check the number of elements in the array
my_array.size # your output should be 6
Accessing Elements
Indexes tell us the position of an element in a data structure and in python, the first element is said to be in the '0th'
position the next in the 1st
position followed by the 2nd
position, then 3rd
position and so on. You can use the indexes to get any element you need in a data structure as follows.
my_array = np.array([[1,2,3,], [4,5,6]])
#accessing the element in the first row, third column
my_array[0, 2] #output should be 3
#accessing the element in the the second row, all columns
my_array[1, :] #output should be 4,5,6
#accessing all the rows and picking the elements in the second column
my_array[:, 1] #output should be 2,5
Slicing Arrays
With slicing, you can create smaller arrays (subarrays) from larger arrays. Here are a few ways to do it.
#lets first create the array we'll be using
my_array = np.array([[1,2,3,], [4,5,6]])
#this picks all rows and checks for the elements from index 1 and 2
my_array[:, 1:3] #output ([[2, 3],[5, 6]])
#this accesses the last row in the array
my_array[-1]
Boolean Slicing
This gives an array as an output of true and false elements depending on the condition specified.
#lets first create the array we'll be using
my_array = np.array([[1,2,3,], [4,5,6]])
#checks if each element meets the specified condition and return true or false values
my_array > 3 #output ([[False, False, False],[ True, True, True]])
Filtering Array Based on Boolean Indexing
This enables us to create an array where our specified condition is met.
#this returns an array where the stated condition is met
condition = my_array > 3
my_array[condition]
Array Operations
Numpy supports a wide range of operations that enable us perform various mathematical and statistical operations. In this section you will be introduce to some of these operations as your skill set expands you will naturally learn more.
Basic operations
Element-wise Arithmetic Operations
#Creating the arrays we'll be using
arr_1 = np.array([1,2,3])
arr_2 = np.array([4,5,6])
#addition
add_arr = arr_1 + arr_2
print(add_arr)
#subtraction
subt_arr = arr_1 - arr_2
print(subt_arr)
#multiplication
mult_arr = arr_1 * arr_2
print(mult_arr)
#division
divd_arr = arr_1 / arr_2
print(divd_arr)
the output should look like this:
arr_1 = [1 2 3]
arr_2 = [4 5 6]
Addition = [5 7 9]
subtraction = [-3 -3 -3]
multiplication = [4 10 18 ]
division = [0.25 0.4 0.5 ]
Broadcasting
In each of the operations above, the arrays had the same dimensions. when we want to perform operations on arrays of different dimensions, NumPy uses broadcasting to make the smaller array match the size of the larger array.
#lets create the array we'll be using
arr_2d = np.array([[1,2,3,], [4,5,6]])
#adding one element to the 2D array
broadcasted_result = arr_2d + 10 #output: array([[11, 12, 13],[14, 15, 16]])
Mathematical Functions(ufuncs)
Universal functions are common mathematical functions that can be performed elementwise on the entire array without the need to use loops. examples are np.cos
, np.sin
, np.exp
etc
#lets create the array we'll be using
arr = np.array([1, 2, 3, 4])
#sin of the array above
sine = np.sin(arr) #output: array([ 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
#cos of the array
cosine = np.cos(arr) #output:array([ 0.54030231, -0.41614684, -0.9899925 , -0.65364362])
#square root of the array
square_root = np.sqrt(arr) #output : array([1. , 1.41421356, 1.73205081, 2. ])
Aggregation Functions
NumPy supports a variety of aggregation functions such as mean, sum, median etc that we can use to compute statistics. Here are some examples below.
#lets create the array we'll be using
arr_2d = np.array([[1,2,3,], [4,5,6]])
Totalsum = np.sum(arr_2d) #output:21
mean_value = np.mean(arr_2d) #output:3.5
median_value = np.median(arr_2d) #output:3.5
Axis Parameter
We use axis parameter to direct how we want the elements in the array to be aggregated. Axis=0 means column wise While axis = 1 means row wise.
#lets create the array we'll be using
arr_2d = np.array([[1,2,3,], [4,5,6]])
#summing along the column axis
np.sum(arr_2d, axis = 0) #output : array([5, 7, 9])
#summing along the row axis
np.sum(arr_2d, axis = 1)# output : array([ 6, 15])
Random Module
Random numbers help to model uncertainty or variability which is important for simulations and experiments in data science and other fields that require computation. we will be looking at how to create and work with random numbers.
Creating Random Numbers
Usingnp.random.rand()
np.random.rand()
helps us to produce random numbers between 0 and 1.
#creating an array of random numbers with the dimension 3x3
np.random.rand(3,3)
the output should look like the ones below, our results may not be exact, but each element should be between 0 and 1.
array([[0.13763363, 0.35381142, 0.00481775],
[0.66866343, 0.39626891, 0.77774918],
[0.7471811 , 0.96245029, 0.30222196]])
Usingnp.random.randn
This function generates an array (of a specified shape) with random values from a standard normal distribution.
#specifying the shape as 3x3
np.random.randn(3,3)
the output should look as follows:
array([[ 0.54437669, -0.45676117, -1.15054337],
[ 0.53277092, 0.02025519, 1.23946618],
[ 0.49110728, 0.55031084, -0.05885217]])
Setting a Seed for Reproducibility
when we request for random numbers, they are created by an algorithm that use a starting point called a seed. The numbers produced in each run will be different as the algorithm uses a different seed each time. when we specify a seed, it tells the random number generator where to start from, hence every time we use the same seed, the same sequence of random numbers will be produced.
#lets try one without seed
no_seed1 = np.random.rand(2,2)
print(no_seed1)
no seed2 = np.random.rand(2,2)
print(no_seed2)
the output without seed
no_seed1 = array([[0.83650398, 0.84172007],
[0.81614047, 0.00322946]])
no_seed2 = array([[0.94614863, 0.5852246 ],
[0.48589582, 0.04137543]])
now using a seed
np.random.seed(42)
with_seed1 = np.random.rand(2,2)
print(with_seed1)
np.random.seed(42)
with_seed2 = np.random.rand(2,2)
print(with_seed2)
the output when a seed is used. we see that each trial produces the exact same array of numbers
with_seed1 = array([[0.37454012, 0.95071431],
[0.73199394, 0.59865848]])
with_seed2 =array([[0.37454012, 0.95071431],
[0.73199394, 0.59865848]])
Random Sampling
The np.random.choice()
function helps us to generate random numbers from a given array.
#lets create the array we'll be using
array = np.array([1,2,3,4,5])
#creating a random sample from the array we just created
np.random.choice(array, size = 3, replace = False)
replace
here ensures that an element is not selected twice while size
indicates how many elements, we want in our random sample array.
If you've gotten to this point, you've gone more than halfway through this chapter! Congratulations, you are a few sections from becoming aNumPy
master, well-done and keep going.
Linear Algebra with NumPy
In data science and Machine learning, linear algebra is used for understanding and working with data sets that involve multiple variables. Some of the linear algebra operations we will be talking about in this section are matrix operations, eigen values and eigen vectors.
Matrix Operations
Matrix multiplication
To multiple matrices we can use np.dot
or @
to carry out the operation, see the code block below for examples.
#creating the matrices we'll be using
matrix1 = np.array([[1,2], [3,4]])
matrix2 = np.array([[5,6], [7,8]])
#using np.dot()
result_a = np.dot(matrix1, matrix2)
print(result_a)
#using the @ operator
result_b = matrix1 @ matrix2
print(result_b)
The output:
result_a = array([[19, 22],
[43, 50]])
result_b = array([[19, 22],
[43, 50]])
Determinant, Inverse & Rank of a Matrix
#creating the matrix we'll be using
matrix3 =np.array([[3,5], [6,7]])
#finding determinant of a matrix
mat_det = np.linalg.det(matrix3)
print(mat_det)
#finding the inverse of a matrix
mat_inv = np.linalg.inv(matrix3)
print(mat_inv)
#finding the rank of a matrix
mat_rank = np.linalg.matrix_rank(matrix3)
print(mat_rank)
mat_det = -8.999999999999998
mat_inv = array([[-0.77777778, 0.55555556],
[ 0.66666667, -0.33333333]])
mat_rank = 2
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors have a lot of applications in DAM some of which include face recognition, page rank algorithms, quantum computing and many others which you will learn about as you grow in your area of specialization. here is an example of how to find eigenvectors.
#creating the array we'll be using
matrix4 = np.array([[1,2], [2,3]])
#finding the eigenvalues/eigenvectors
eigenvalue, eigenvector = np.linalg.eig(matrix4)
print('eignevalues', eigenvalues)
print('eigenvectors', eigenvectors)
output
eignevalues [-0.23606798 4.23606798]
eigenvectors [[-0.85065081 -0.52573111]
[ 0.52573111 -0.85065081]]
I added the strings 'eigenvalues' and 'eigenvectors' in the print statement so that we can tell the values apart. Be sure to run the eigen(values/vectors) code with the corresponding print statement in the same block of code when if you are using JuPyter notebook, if not you may run into Name error
as I did when putting this tutorial together.
Advanced Topics
There are several advanced operations in numpy that you will be exposed to as you grow in your area of specialization. In this section I touch on some of the general advanced topics that you are likely to encounter when working with data
Reshape and Flatten Arrays
There are times you will need to change the shape of the data you are working with. Sometimes the function or algorithm may require the data to be in 1D, 2D or 3D etc. in such cases reshaping and flattening comes handy. See the subsections below for examples.
Reshaping
Let's create a 1D array and reshape it into a 3D array.
#creating a 1D array
arr = np.arange(1,10)
#reshaping it to a 3x3 array
np.reshape(arr, (3,3))
the output:
#arr
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
#reshaped array
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Flattening
Flattening is used for converting a multidimensional array to a 1D array. Let's illustrate with an example as follows.
#First we create the matrix we'll be using
matrix = np.array([[2,3,4], [5,6,7]])
#flattening the matrix
np.ravel(matrix)
#second method
matrix.flatten()
#output
array([2, 3, 4, 5, 6, 7])
Sorting and Searching
Sorting
np.sort()
returns a sorted copy of an array while the np.argsort()
returns a sorted array in terms of indices.
arr = np.array([3,1,1,2,6,3,9,4,8,7,6,1,2])
#sorting the Array
np.sort(arr)
#output: array([1, 1, 1, 2, 2, 3, 3, 4, 6, 6, 7, 8, 9])
#sorting the array in terms of indices
np.argsort(arr)
#output : array([ 1, 2, 11, 3, 12, 0, 5, 7, 4, 10, 9, 8, 6], dtype=int64)
Searching
we can use np.where()
returns the indices where a specified criteria is met while np.searchsorted()
is used on sorted arrays to find the index where a value can be inserted so that the array maintains its sorted state. See the code block below for clarification.
#creating the array we'll be using
arr = np.array([3,5,7,9,7,4,5,5,7,2])
np.where(arr > 5)
#output:(array([2, 3, 4, 8], dtype=int64),)
np.searchsorted(arr, 5)
#output : 6
Data Manipulation
Data manipulation means transforming, cleaning and structuring data to make it ready and easy to work with. This may take various forms as we in this section.
Concatenation and Stacking
Concatenation
Concatenation means combining arrays. There are several ways to do that with numpy which include np.concatenate()
, np.hstack
, np.vstack
. Let's practice with the example below.
#creating the arrays we'll be using
arr1 = np.array([[1,2,3], [4,5,6]])
arr2 = np.array([[5,6,7]])
#concatenating them vertically
vert_concat = np.concatenate((arr1, arr2), axis=0 )
print(vert_concat)
#concatenating them horizontally
hor_concat =bnp.concatenate((arr1, arr2), axis=1)
print(hor_concat)
output:
vert_concat : array([[1, 2, 3],
[4, 5, 6],
[5, 6, 7],
[7, 8, 9]])
hor concat:
array([[1, 2, 3, 5, 6, 7],
[4, 5, 6, 7, 8, 9]])
Splitting arrays
Splitting arrays is a versatile and fundamental operation when working with data. in the context of this series, it enables proper model evaluation, facilitates efficient processing of large datasets and supports various data exploration and analysis tasks.
Splitting Arrays into smaller arrays
We can use the np.split()
or np.array_split()
to achieve this. Let's show that with examples.
#First we create the array we'll be using
arr = np.array([1,9,4,7,6,2,4,1,9,4,0,8,2,5,4,4,8,9,3,0,2,7,4,6,3])
#splitting the array into 5 subarrays
np.split(arr, 5)
output:
[array([1, 9, 4, 7, 6]),
array([2, 4, 1, 9, 4]),
array([0, 8, 2, 5, 4]),
array([4, 8, 9, 3, 0]),
array([2, 7, 4, 6, 3])]
Unique Values
We can find unique values in an array using np.unique()
, forexample;
arr = np.array([1,9,4,7,6,2,4,1,9,4,0,8,2,5,4,4,8,9,3,0,2,7,4,6,3])
np.unique(arr) #output: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Saving And Loading Data
Saving
We use np.save()
for saving files. If we add the .npy
extension to the name of the file we are saving, it saves it in a special binary format designed for numpy arrays. We can also use np.savetxt()
to save it as a text file that is readable by human beings.
#lets create the array we want to save
arr= np.array([[2,6,4,8], [6,2,5,7]])
#saving it using the first method
np.save('saved_array.npy', arr)
#Saving it using ther second method
np.savetxt('saved_array.txt', arr)
Loading
np.load()
is used to load a single array from a .npy
file while np.loadtxt()
is used to load arrays form text files.
loaded_array = np.load('saved_array.npy')
print(loaded array)
#or
loaded_array_txt = np.loadtxt('saved_array.txt')
print(loaded_array_txt)
In Conclusion
In this episode we have:
Been introduced to Arrays, the core data structure in Numpy. We learnt how to create and carry out many operations on arrays.
Looked at the random module where we learnt how to create and use random numbers.
We carried out some linear algebra operations such as matrix operations, eigenvalues and eigen vectors.
We learnt some advanced NumPy topics like reshaping and flattening arrays, sorting and searching.
We touched on the following Data manipulation operations; concatenation, stacking, splitting and unique values.
Then last but not least we learnt to how to save and load arrays.
Phew! that was a lot! you should be really proud of yourself and take a moment to celebrate, but before that see the next last section.
Next Steps
Practice! practice!! practice !!! get your hands on as many practice exercises as possible as this is the best way to lear, remember, you lose what you don't use.
Like we said in the previous episode don't wait until you feel you've learnt everything, always strive to implement what you've learn, you will run into challenges that will require you to do some research, this is very normal even for those with experience. it's your determination during such periods that build into the professional you are aiming to become.
Use can use google, youtube and chatgpt to carry out your research. You can also make reference to The official numpy documentation or the official Numpy quickstart tutorial.
One last thing
It's time to take a moment to celebrate your milestone of mastering NumPy. When you are done, check chapter 3 of this series where we'll be discussing all about Pandas. Until then cheers to you!
Hi, I am Ajoke and i am a data specialist proficient in the use of Python, SQL, excel and Tableau. if you need help with Data related technical writing, training or analytics. you can reach me via email via ajokearegbeyen@gmail.com .