Part I: Introduction to Python #
Why do you need python for data analysis?#
Easy to learn: clean and simple, easy to read, inituitive. You will start writing code in about 1 minute..
Reproducibility: writing automated scripts to analyse data ensures reproducibility.
Versatile and extensible: A number of useful libraries for scientific computing and data analysis such as numpy, scipy, matplotlib, pandas and modules for data science.
However much slower than C++ and FORTRAN
The Jupyter Notebook#
interactive editor for Python (and other languages)
Some useful shortcuts#
Enter
on a cell to edit itEsc
to stop editingAlt+Enter
to execute cellA
to insert cell aboveB
to insert cell below
Typing code#
# You can put comments by putting "#" first
You can print by,
print("Hello, World!")
Indentation is important!
This works -
if 5 > 2:
print("Five is greater than two!")
This does not-
if 5 > 2:
print("Five is greater than two!")
Variables in python#
Python has five standard data types-
Number - integer, float, long, complex, boolean
string
list
tuple
dictionary
Variables can have any names, except for reserved python names such as class, def, lambda, int
etc
Python directly guesses the data type of the variable.
# Examples
name = "Ada Lovelace"
a = 1
b = 1.23
c = True
d = type(c)
print(name, a, b, c, d)
Task
Exchange the value of two variablesx = 5
y = 2
to,
x = 2
y = 5
Sequences#
Lists
numbers = [1, 2, 3, 4]
numbers = [1, 2, 3, 'four', True]
numbers[0]
numbers[-1]
len(numbers)
Tuples
numbers = (1, 2, 3, 4)
numbers = (1, 2, 3, 'four', True)
Task
What is the difference between list and tuples?a_list = [1,2,3,4]
a_tuple = (1,2,3,4)
try to change the first element of both to ‘one’
a_list[0] = ‘one’
Operations#
Arithmetic operations: +
, -
, *
, /
, %
, **
, //
#
a = 2
b = 3
a+b, a-b
a*b, a**b
a/b, a//b
a%b
Comparison operators - ==
, !=
, >
, <
, >=
, <=
#
a == 2
Logical and identity operators - in
, not in
, is
, is not
, and
, or
, not
#
a in [1, 2, 3, 4]
Task
What is the difference between between "==" and "is"?a = 500 b = 500
check the values of “a == b” and “a is b”
Dictionaries#
A dictionary can be used for storing heterogeneous data and has a key and corresponding entry. It can be defined by
person = {"name": "Maria", "age": 34, "telephone": 23458991 }
person.keys()
person.items()
Control flows#
if
, else
and elif
x = 3.5
if x <= 1.0:
print("low")
elif (x > 1.0) and (x < 3.0):
print("average")
elif (x >= 3.0):
print("high")
else:
print("invalid")
for
loops
x = [1,2,3,4,5]
for i in x:
y = i**2 + 3
print(y)
Controlling loops with break
and continue
Find first three even numbers up to 10
even_numbers = []
for n in range(1, 10):
#if its odd
if (n%2) != 0:
continue
even_numbers.append(n)
if len(even_numbers) == 3:
break
even_numbers
Task
Write a loop which calculates first 6 terms of the Fibonacci sequenceterm1 = 0
term2 = 1
for x in range..
Functions#
Functions are an integral part of any programming language. A function is used to take some values, do a task and return the required information.
def add_numbers(x, y):
"""
adds two numbers
"""
s = x+y
return s
add_numbers(2, 3)
The above function uses two values as input, x
and y
. They are the arguments of the function. It returns a calculated value s
, which is the return value. A function can also not retun any values, or have keyword arguments.
def add_numbers(x, y=3):
"""
adds two numbers
"""
s = x+y
return s
add_numbers(2), add_numbers(2, y=4)
Task
Convert the code you wrote for Fibonacci sequence to a functionClasses#
Class
defines an object and attaches several and attributes and functions to it. Let us build a very simple example. A circle can be defined by its radius, from which other attributes can be derived such as circumference and area.
Before we start, its beneficial to think what attributes a class Circle
might/should have. You can have class attributes and associated functions. A good example of an attribute for example, is radius
. An associated function could be calculate_area
, which would calculate the area of the circle since we know the radius
already.
class Circle:
"""
Circle class to hold properties of a circle
"""
def __init__(self, radius=None):
self.radius = radius
#other variables are set to None
self.area = None
self.circumference = None
def get_area(self):
"""
Calculate area
"""
self.area = 3.14 * self.radius**2
def get_circumference(self):
"""
Calculate circumference
"""
self.circumference = 2.0 * 3.14 * self.radius
small_circle = Circle(radius=8)
big_circle = Circle(radius=24)
Calculate the area and circumference of the Circle
small_circle.get_area()
small_circle.get_circumference()
You can access the class attributes through its object
small_circle.area
small_circle.circumference
small_circle.radius
Using libraries#
The major strength of the python ecosystem are libraries. Python provides a number of libraries which a person can import and use.
Numpy#
Numpy offers a lot of useful tools for all aspects of science and statistics
import numpy as np
Numpy arrays are faster and easier to handle - but they should be homogeneous!
A = np.array([1,2,3,4,5,6])
Mathematical operations on numpy arrays
Mathematical operations on numpy arrays are different from those on lists. They are vectorized.
A = np.ones(3)
A
B = A + A
B
The individual elements are added. But the lengths of course have to be same
C = np.sqrt(B)
C
D = B*B
D
Pandas#
Pandas or Python Data Analysis Library is one of the most useful tools for working with tabular data in python. The central aspect in pandas is a DataFrame. A DataFrame is a 2-dimensional data structure that can store data of different types.
import pandas as pd
Use pandas to read a csv file
df = pd.read_csv("dax-ti-static.csv")
df
Some useful commands
df.head(), df.tail(), df.columns, df.shape
Task
Can you guess what is the data in the csv file that you read in?df['YM'].max()
df['YM'].min()
df['YM'].idxmax()
df.iloc[131]['formula']
Further reading..