Part I: Introduction to Python #

Why do you need python for data analysis?#

  • Easy to learn: clean and simple, easy to read, inituitive. You will start writing code in about 1 minute..

  • Reproducibility: writing automated scripts to analyse data ensures reproducibility.

  • Versatile and extensible: A number of useful libraries for scientific computing and data analysis such as numpy, scipy, matplotlib, pandas and modules for data science.

  • However much slower than C++ and FORTRAN

The Jupyter Notebook#

  • interactive editor for Python (and other languages)

Some useful shortcuts#

  • Enter on a cell to edit it

  • Esc to stop editing

  • Alt+Enter to execute cell

  • A to insert cell above

  • B to insert cell below

Typing code#

# You can put comments by putting "#" first

You can print by,

print("Hello, World!")

Indentation is important!

This works -

if 5 > 2:
    print("Five is greater than two!")

This does not-

if 5 > 2:
print("Five is greater than two!")

Variables in python#

Python has five standard data types-

  • Number - integer, float, long, complex, boolean

  • string

  • list

  • tuple

  • dictionary

Variables can have any names, except for reserved python names such as class, def, lambda, int etc

Python directly guesses the data type of the variable.

# Examples
name = "Ada Lovelace"
a = 1
b = 1.23
c = True
d = type(c)

print(name, a, b, c, d)

Task

Exchange the value of two variables

x = 5
y = 2

to,

x = 2
y = 5

Sequences#

Lists

numbers = [1, 2, 3, 4]
numbers = [1, 2, 3, 'four', True]
numbers[0]
numbers[-1]
len(numbers)

Tuples

numbers = (1, 2, 3, 4)
numbers = (1, 2, 3, 'four', True)

Task

What is the difference between list and tuples?

a_list = [1,2,3,4]
a_tuple = (1,2,3,4)

try to change the first element of both to ‘one’

a_list[0] = ‘one’

Operations#

Arithmetic operations: +, -, *, /, %, **, //#

a = 2
b = 3
a+b, a-b
a*b, a**b
a/b, a//b
a%b

Comparison operators - ==, !=, >, <, >=, <=#

a == 2

Logical and identity operators - in, not in, is, is not, and, or, not#

a in [1, 2, 3, 4]

Task

What is the difference between between "==" and "is"?

a = 500 b = 500

check the values of “a == b” and “a is b”

Dictionaries#

A dictionary can be used for storing heterogeneous data and has a key and corresponding entry. It can be defined by

person = {"name": "Maria", "age": 34, "telephone": 23458991 }
person.keys()
person.items()

Control flows#

if, else and elif

x = 3.5
if x <= 1.0:
    print("low")
elif (x > 1.0) and (x < 3.0):
    print("average")
elif (x >= 3.0):
    print("high")
else:
    print("invalid")

for loops

x = [1,2,3,4,5]
for i in x:
    y = i**2 + 3
    print(y)

Controlling loops with break and continue

Find first three even numbers up to 10

even_numbers = []
for n in range(1, 10):
    #if its odd
    if (n%2) != 0:
        continue
    even_numbers.append(n)
    if len(even_numbers) == 3:
        break
even_numbers

Task

Write a loop which calculates first 6 terms of the Fibonacci sequence

term1 = 0
term2 = 1

for x in range..

Functions#

Functions are an integral part of any programming language. A function is used to take some values, do a task and return the required information.

def add_numbers(x, y):
    """
    adds two numbers
    """
    s = x+y
    return s
add_numbers(2, 3)

The above function uses two values as input, x and y. They are the arguments of the function. It returns a calculated value s, which is the return value. A function can also not retun any values, or have keyword arguments.

def add_numbers(x, y=3):
    """
    adds two numbers
    """
    s = x+y
    return s
add_numbers(2), add_numbers(2, y=4)

Task

Convert the code you wrote for Fibonacci sequence to a function

Classes#

Class defines an object and attaches several and attributes and functions to it. Let us build a very simple example. A circle can be defined by its radius, from which other attributes can be derived such as circumference and area.

Before we start, its beneficial to think what attributes a class Circle might/should have. You can have class attributes and associated functions. A good example of an attribute for example, is radius. An associated function could be calculate_area, which would calculate the area of the circle since we know the radius already.

class Circle:
    """
    Circle class to hold properties of a circle
    """
    def __init__(self, radius=None):
        
        self.radius        = radius
        #other variables are set to None
        self.area          = None
        self.circumference = None
    
    def get_area(self):
        """
        Calculate area
        """
        self.area = 3.14 * self.radius**2

    def get_circumference(self):
        """
        Calculate circumference
        """
        self.circumference = 2.0 * 3.14 * self.radius
small_circle = Circle(radius=8)
big_circle = Circle(radius=24)

Calculate the area and circumference of the Circle

small_circle.get_area()
small_circle.get_circumference()

You can access the class attributes through its object

small_circle.area
small_circle.circumference
small_circle.radius

Using libraries#

The major strength of the python ecosystem are libraries. Python provides a number of libraries which a person can import and use.

Numpy#

Numpy offers a lot of useful tools for all aspects of science and statistics

import numpy as np

Numpy arrays are faster and easier to handle - but they should be homogeneous!

A = np.array([1,2,3,4,5,6]) 

Mathematical operations on numpy arrays

Mathematical operations on numpy arrays are different from those on lists. They are vectorized.

A = np.ones(3)
A
B = A + A
B

The individual elements are added. But the lengths of course have to be same

C = np.sqrt(B)
C
D = B*B
D

Pandas#

Pandas or Python Data Analysis Library is one of the most useful tools for working with tabular data in python. The central aspect in pandas is a DataFrame. A DataFrame is a 2-dimensional data structure that can store data of different types.

import pandas as pd

Use pandas to read a csv file

df = pd.read_csv("dax-ti-static.csv")
df

Some useful commands

df.head(), df.tail(), df.columns, df.shape

Task

Can you guess what is the data in the csv file that you read in?
df['YM'].max()
df['YM'].min()
df['YM'].idxmax()
df.iloc[131]['formula']