Hopefully you're convinced of the utility of using Python for data analysis. It is a powerful language that is easy to read, easy to understand, and easy to program in. Even if this is your very first programming language, it's a great place to start.
This is a not a course for learning programming, however. Some light programming is required, but remember, our goal is to gain intelligence about data. The value of data is that contained within it is information. The data by itself is useless unless we know how to extract useful knowledge out of it.
In this chapter, we will introduce a few simple packages for working with you data. A package is a set of tools that extend Python's capabilities. Rather than making the task of programming more complex, they actually simplify the process, because for most of the kinds of tasks you would wish to do in Python, somebody else has already written a package that does most of the heavy lifting for you.
NumPy is used to create and manipulate arrays and matrices. If you are unfamiliar with arrays and matrices, think of columns, rows, and tables in a spreadsheet. Most the data you could every want to analyse could be represented as a column, a row, or a table in a spreadsheet. So you need some tools to be able to work with this kind of data. Enter Numpy.
# Let's first import the NumPy package, and represent it as the object 'np'
import numpy as np
# Let's create an 'array' with numpy (note the np.-prefix) and call that 'data'
data = np.array([0.5, 1.2, 2.2, 3.4, 3.5, 3.4, 3.4, 3.4])
So now we've created an array called data
. What can we know about this array?
# What is the 'shape' of this array
print(data.shape)
I.e. the array has only 1-dimension and it's 8 values long. The structure data.shape
with the period, means you are accessing the array called "data", then asking for the shape
attribute of that array. The shape
attribute is attached to all NumPy arrays.
# Can we create a 2-D dataset?
random_data = np.random.random([3,2])
print(random_data)
What we have done here is create a "matrix" (think table) of values, each value chosen to lie randomly somewhere between 0 and 1. The matrix has 3 rows and 2 columns. Let's ask Python for its shape.
print(random_data.shape)
It prints a pair of values, the first one giving the number of rows, and the second giving the number of columns.
What if we wanted to know the total number of values in our table? For that we'd use the size
function.
print(random_data.size)
Python tells us that the matrix called random_data
has a total of 6 values, one for each cell in each row and column.
What if we wanted to know some very basic statistics about our matrix, such as what the maximum and minimum values were, as well as the average?
print('Max: ', random_data.max())
print('Min: ', random_data.min())
The parentheses ()
are sometimes required when applying a built-in function to a dataset. Here we've applied the NumPy built-in functions max
and min
to the random_data
table. The reason we didn't have to do this when we asked for the shape
or the size
of our matrix, was because shape
and size
are saved as values of the data structure, and are not functions being applied to them. It's a technicality, which you'd care more about if you were studying programming. But we're just interested in doing some basic data analysis right now.
Right, we were going to compute the average too, so let's do that.
print('The mean of all values in random_data: ', random_data.mean())
I'm going to stop talking about matrices vs. arrays from now on. A matrix is just a 2-dimensional array. Python can handle n-dimensional arrays, so everything comes down to being an array. In practices, most arrays are just 1-, 2-, or 3-dimensional. I've worked with higher-order ones before, but they can get harder to wrap your head around. Don't worry about them too much for now. I'm going to call everything an array from now on.
Sometimes you need to create arrays in Python, and NumPy is the tool of choice in data analysis. Let's see a few examples.
You're examining some historial data and you need a helpful array to carry the years between 1850 and 1917. These should be whole number (no decimal points) and there should be one for every year.
years = np.arange(1850, 1917, dtype=int)
print(years)
The arange
command creates an array over the range of values that you declare in calling the NumPy function. Notice how I also declared dtype=int
. That means I want the "data type" to be integers (int
) instead of, say, floating point numbers (float
).
arange
works across a "semi-open" interval, meaning that 1850 was included in the resulting array, but the end-point 1917 was not. If I wanted to be inclusive of the year 1917 in my array, I'd have to specify np.arange(1850, 1918, dtype=int)
. Just be aware of this.
How many years are included in your array? Let's ask Python
print(years.size)
67 years in my range.
This is only a very slight modification on the previous example. If we care only about 5-year intervals, we can create the same array, but declare the step size to be 5 years. This is how we'd do that:
years = np.arange(1850, 1917, 5, dtype=int)
print(years)
Let's do a quick test! You want to use a random number generator for handling a program that generates a "coin flip" for you. But you got to make sure that your random number generator is giving you equal amounts of "heads" vs. "tails". So let's generate a large array of random numbers and do some counting.
# Make an array of random numbers, 10000 long
n_flips = 10000
random_numbers = np.random.randint(0, 2, size=n_flips)
random_numbers[:100]
# Heads: Create an array of all the times the random number was 0
heads = np.array([x for x in random_numbers if x == 0])
# Tails: Create an array of all the times the random number was 1
tails = np.array([x for x in random_numbers if x == 1])
# Print the first 10 numbers in `heads`
print(heads[0:10])
# Print the first 10 numbers in `tails`
print(tails[0:10])
Now let's get back to the main question. I generated 10,000 random numbers in a large array, then chopped them up into two piles. If the random number generator worked right, it should have evenly distributed to the values between those to the left of 0.5 and those to the right. Let's count them:
print('Heads : ', heads.size)
print('Tails : ', tails.size)
# One standard deviation
sigma = np.sqrt(n_flips/4)
print(sigma)
Try repeating for n_flips = 1000000
Data visualization is a skill equal in importance to data analysis. As a data scientist, you must not only be able to wrangle the data into your machinery for processing, you must also be able to communicate the results of your data. We want to communicate clearly and effectively, which means we now need to start learning about how to put our data in graphs and figures.
Data visualization combines practical skills with creative design skills. These skills are being employed by journalists and news editors to communicate complex ideas. Scientists use data visualization to understand their experiments better. It is a form of exploratory data analysis and one of the best ways to get a handle on what the data is saying.
For some examples of how data visualization can powerfully communicate ideas, visit the visualization gallery at Information is Beautiful.
There are many tools that you can use to visualize data. Some are easier to use than others. Most people will have had some experience making bar or pie charts in Microsoft Excel, but sometimes you need to make more sophisticated plots, or creatively combine various kinds of graphs to communicate an idea. As a data scientist, you will eventually have to use many different kinds of tools. Many will involve programming. Some, like Tableau, allow the user to manipulate the data via a graphical user interface (GUI). It's important to be flexible with different techniques.
The matplotlib
library built for Python contains a large array of tools for making beautiful visualizations. You can see some examples in their visual gallery.
Because matplotlib
allows for simple plotting within Python, we can easily combine our analysis code with our visualization code without having to switch between applications or programming languages.
And for users who are familiar to drawing plots in MATLAB, the syntax is actually quite similar, so you should be up and running in a very short amount of time.
Some of the simplest plots you will make might be from some data and arrays that you generate yourself.
Suppose you hear somewhere that the average IQ score is 100 and the standard deviation is 15 points. Notwithstanding all of the challenges surrounding psychometrics, let's just run with this notion for now. If we assume that intelligence follows a normal distribution (the familiar "bell curve"), then the distribution of intelligence within a population should be described by the function
$$ \textrm{IQ} = \frac{1}{\sigma \sqrt{2\pi}} \exp \left(-\frac{(x-\mu)^2}{2 \sigma^2}\right) $$where $\mu$ is the mean or average value and $\sigma$ is the standard deviation.
Let's plot this function in Python using matplotlib
.
The cell below contains a "magic command" in Jupyter: %maplotlib inline
. This gives the special instruction to Jupyter to display the plot generated by matplotlib directly beneath the code cell, as inline output. You need to make sure you add %matplotlib inline
to every Jupyter Notebook that makes use of matplotlib.
%matplotlib inline
import numpy as np # We're stilling going to need NumPy to handle our arrays
import matplotlib.pyplot as plt # Pyplot is actually subclass that we'll need to do plotting
avgIQ = 100
stdev = 15
def IQfunc(x):
'''
Returns the probability density in a population of individuals for having an IQ of value x
'''
IQ = 1/(stdev*np.sqrt(2 * np.pi)) * np.exp(- (x-avgIQ)**2 / (2*stdev**2))
return IQ
# Create an array with the range of possible IQ values, say from 0 to 150
IQs = np.linspace(0, 150, 200)
density = IQfunc(IQs)
# Plot the population density against the range of possible IQs
plt.plot(IQs,density)
# Turn on a grid
plt.grid(True, alpha=0.3)
plt.plot(IQs, density)
plt.xlabel('IQ')
plt.ylabel('Normalized Density')
plt.title('IQ Distribution')
plt.grid(True, alpha=0.3)
Hmm, OK, now what about modifying the font sizes, line width, and adding some annotation?
plt.plot(IQs, density, linewidth=2)
plt.xlabel('IQ', fontsize=15)
plt.ylabel('Normalized Density', fontsize=15)
plt.title('IQ Distribution', fontsize=17)
# Add text at data coordinates (10,0.021). Use \n to add a line break in the text string
plt.text(10,0.021,'Average IQ: 100\nStandard Deviation: 15', fontsize=12)
# Draw a grid
plt.grid(True, alpha=0.3)
# Shade the region underneath the curve at 20% opacity
plt.fill(IQs, density, alpha=0.2)
# Save the figure to a file
plt.savefig('iq_distribution.png')