Machine Learning with Python - Day 3¶

Cognizant, Bangalore
March, 2015
Jigsaw Academy

This live notes are avaialble online at http://bit.ly/cognizant-py.

Working with Files¶

%%file three.txt
one
two
three

Writing three.txt

f = open("three.txt")

f.read()

'one\ntwo\nthree'

open("three.txt").read()

'one\ntwo\nthree'

print(open("three.txt").read())

one
two
three

open("three.txt").readlines()

['one\n', 'two\n', 'three']

for line in open("three.txt").readlines():
    print(line, end="")

one
two
three

for i, line in enumerate(open("three.txt").readlines()):
    print("Line", i, ":", line, end="")

Line 0 : one
Line 1 : two
Line 2 : three

for line in open("three.txt"):
    print(line, end="")

one
two
three

Q: What happens if we read the same file object twice?

f = open("three.txt")
f.read()

'one\ntwo\nthree'

f.read()

''

That is because the file pointer is at the end.

f.tell() # file offset at the end of the file

13

open("three.txt").tell() # file offset at the beginning of the file

0

f.seek(0) 
f.tell()

0

f.read()

'one\ntwo\nthree'

Problem: Write a program cat.py that takes a filename as argument and prints all the contents of the file.

$ python cat.py three.txt
one
two
three

Example: Word Count¶

Lets try to implement the unix word count command wc in Python.

%%file numbers.txt
1 one
2 two
3 three
4 four
5 five

Writing numbers.txt

!wc numbers.txt

       5      10      34 numbers.txt

%%file wc.py
import sys

def linecount(f):
    return len(open(f).readlines())

def wordcount(f):
    return len(open(f).read().split())

def charcount(f):
    return len(open(f).read())

def main():
    f = sys.argv[1]
    print(linecount(f), wordcount(f), charcount(f), f)
    
main()

Overwriting wc.py

!python wc.py numbers.txt

5 10 34 numbers.txt

Problem: Write a program sumfile.py that takes a filename as command-line argument and prints the sum of all numbers in that file. It is assumed that the file has one number per line.

$ python sumfile.py one-to-ten.txt
55

Problem: Write a program head.py that takes a filename as command-line argument and prints the first five lines of the file.

$ python head.py one-to-ten.txt
1
2
3
4
5

Problem: Write a program grep.py that takes a pattern and a filename as arguments and prints all the lines containing that pattern.

$ python grep.py def wc.py
def linecount(f):
def wordcount(f):
def charcount(f):
def main():

%%file sumfile.py
import sys
filename = sys.argv[1]
numbers = [int(line) for line in open(filename)]
print(sum(numbers))

Writing sumfile.py

%%file one-to-ten.txt
1
2
3
4
5
6
7
8
9
10

Writing one-to-ten.txt

!python sumfile.py one-to-ten.txt

55

Writing to Files¶

File can be opened in write mode my specifying "w" as second argument.

f = open("a.txt", "w")
f.write("one\n")
f.write("two\n")
f.close()

Lets see what we have in that file now.

open("a.txt").read()

'one\ntwo\n'

Q: How to test if a file already exists?

import os
os.path.exists("a.txt")

True

os.path.exists("b.txt")

False

To add more contents to an existing file, we need to open the file in append mode.

f = open("a.txt", "a")
f.write("three\n")
f.close()

open("a.txt").read()

'one\ntwo\nthree\n'

The `with` Statement¶

The with statement is handy when writing to files as it closes the file automatically at the end of the with block.

with open("b.txt", "w") as f:
    f.write("one\n")
    f.write("two\n")    
# f gets closed automatically here

open("b.txt").read()

'one\ntwo\n'

Problem: Write a program copyfile.py to copy contents of one file to another. The program should accept two filenames as command-line argument and copy the first one into the second.

$ python copyfile.py a.txt a2.txt

WARNING: Don't call the file copy.py as it interferes with built-in module copy

Problem: Write a program mergefiles.py that takes one target file and multiple source files as arguments and copies the contents of all source files into the target file.

$ python mergefile.py ten.txt five.txt five-to-ten.txt

+Problem: Write a program split.py that splits a large file into multiple smaller files. The program should take a filename and the number of lines as arguments and write multiple small files each containing the specified number of lines (The last one may have smaller number of lines).

$ python split.py 100.txt 30
writing 100.txt-part1
writing 100.txt-part2    
writing 100.txt-part3    
writing 100.txt-part4

%%file copyfile.py
import sys
src = sys.argv[1]
dest = sys.argv[2]

contents = open(src).read()

with open(dest, "w") as f:
    f.write(contents)

Writing copyfile.py

%%file mergefiles.py
import sys

destfile = sys.argv[1]
srcfiles = sys.argv[2:]

print(destfile)
print(srcfiles)

with open(destfile, "w") as dest:
    for f in srcfiles:
        dest.write(open(f).read())

Overwriting mergefiles.py

!python mergefiles.py c.txt a.txt b.txt

c.txt
['a.txt', 'b.txt']

print(open("c.txt").read())

one
two
three
one
two

Binary and Text¶

type("helloworld")

str

The encode method encodes the given string as bytes using specified encoding.

x = "helloworld".encode("ascii")
x

b'helloworld'

"h"

'h'

"h".encode("utf-8")

b'h'

type(x)

bytes

Literal bytes are written like strings, but with a b prefix.

b'these are bytes'

b'these are bytes'

Lets look at some unicode text.

name = "\u0c85\u0c86\u0c87\u0c88"

name

'ಅಆಇಈ'

len(name)

4

# this will fail
name.encode("ascii")

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-63-65ea99204210> in <module>()
      1 # this will fail
----> 2 name.encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

name.encode("utf-8")

b'\xe0\xb2\x85\xe0\xb2\x86\xe0\xb2\x87\xe0\xb2\x88'

name_bytes = name.encode("utf-8")

len(name_bytes)

12

Lets try to write the name into a file.

f = open("kannada.txt", "w", encoding="utf-8")

f.write(name)
f.close()

# look at the file size using unix command ls
!ls -l kannada.txt

-rw-r--r--  1 anand  staff  12 Mar 16 11:27 kannada.txt

open("kannada.txt", "r", encoding="utf-8").read()

'ಅಆಇಈ'

open("kannada.txt", "rb").read()

b'\xe0\xb2\x85\xe0\xb2\x86\xe0\xb2\x87\xe0\xb2\x88'

open("kannada.txt", "r", encoding="utf-8")

<_io.TextIOWrapper name='kannada.txt' mode='r' encoding='utf-8'>

open("kannada.txt", "rb")

<_io.BufferedReader name='kannada.txt'>

!cat kannada.txt

������������

name_bytes

b'\xe0\xb2\x85\xe0\xb2\x86\xe0\xb2\x87\xe0\xb2\x88'

name_bytes.decode("utf-8")

'ಅಆಇಈ'

Q: How strings and bytes work in Python 2?

%%file bytes.py
# -*- encoding: utf-8 -*-

name = 'ಅಆಇಈ'
print("hello" + name)

Overwriting bytes.py

!python bytes.py

Traceback (most recent call last):
  File "bytes.py", line 4, in <module>
    print("hello" + name)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-8: ordinal not in range(128)

Seems to be some issue here. Lets ignore this for now.

Example: Reading CSV files¶

%%file a.csv
A1,B1,C1
A2,B2,C2
A3,B3,C3

Writing a.csv

open("a.csv").readlines()

['A1,B1,C1\n', 'A2,B2,C2\n', 'A3,B3,C3']

[line for line in open("a.csv").readlines()]

['A1,B1,C1\n', 'A2,B2,C2\n', 'A3,B3,C3']

[line for line in open("a.csv")]

['A1,B1,C1\n', 'A2,B2,C2\n', 'A3,B3,C3']

[line.strip("\n") for line in open("a.csv")]

['A1,B1,C1', 'A2,B2,C2', 'A3,B3,C3']

[line.strip("\n").split(",") for line in open("a.csv")]

[['A1', 'B1', 'C1'], ['A2', 'B2', 'C2'], ['A3', 'B3', 'C3']]

def read_csv(filename):
    return [line.strip("\n").split(",") for line in open(filename)]

read_csv("a.csv")

[['A1', 'B1', 'C1'], ['A2', 'B2', 'C2'], ['A3', 'B3', 'C3']]

Problem: Improve the read_csv function written above to ignore empty lines and comments. Assume that comment lines start with a # character.

%%file b.csv
# begin
A1,B1,C1

A2,B2,C2
# last line
A3,B3,C3
#end

Overwriting b.csv

Problem: Improve the read_csv function further to take delimiter as optional argument. The delimiter should default to , when not specified.

>>> read_csv("c.txt", delimiter=":")
[['A1', 'B1', 'C1'], ['A2', 'B2', 'C2'], ['A3', 'B3', 'C3']]

%%file c.txt
A1:B1:C1
A2:B2:C2
A3:B3:C3

Overwriting c.txt

Writing Custom Modules¶

%%file mymodule.py
print("BEGIN mymodule")
x = 1

def add(a, b):
    return a+b

print(add(3, 4))
print("END mymodule")

Writing mymodule.py

!python mymodule.py

BEGIN mymodule
7
END mymodule

Lets say we want to use the add function defined in the mymodule.py somewhere else.

%%file a.py
import mymodule
print(mymodule.x)
print(mymodule.add(2, 3))

Writing a.py

!python a.py

BEGIN mymodule
7
END mymodule
1
5

The `name` magic variable¶

Now we don't want these prints from the file to come when it is imported as a module, but they are required when the file is run as a script.

%%file mymodule2.py
x = 1

def add(a, b):
    return a+b

print(add(3, 4))
print(__name__)

Overwriting mymodule2.py

!python mymodule2.py

7
__main__

When the file is executed as a script, the special variable __name__ is set to "__main__".

!python -c "import mymodule2"

7
mymodule2

But when the file is imported as a module, the __name__ is set to the module name.

%%file mymodule3.py
x = 1

def add(a, b):
    return a+b

if __name__ == "__main__":
    # Run the following code only when this file is 
    # executed as a script.
    # Ignore this when imported as a module.
    print(add(3, 4))

Writing mymodule3.py

!python mymodule3.py

7

!python -c "import mymodule3"

Problem: Make the wc.py that we write earlier importable.

>>> import wc
>>> wc.linecount("a.txt")
3

%%file wc2.py
import sys

def linecount(f):
    return len(open(f).readlines())

def wordcount(f):
    return len(open(f).read().split())

def charcount(f):
    return len(open(f).read())

def main():
    f = sys.argv[1]
    print(linecount(f), wordcount(f), charcount(f), f)

if __name__ == "__main__":    
    main()

Overwriting wc2.py

import wc2

wc2.linecount("a.txt")

3

help("wc2")

Help on module wc2:

NAME
    wc2

FUNCTIONS
    charcount(f)
    
    linecount(f)
    
    main()
    
    wordcount(f)

DATA
    __warningregistry__ = {'version': 341, ("unclosed file <_io.TextIOWrap...

FILE
    /Users/anand/trainings/2016/cognizant/wc2.py

Docstrings¶

def square(x):
    return x*x

help(square)

Help on function square in module __main__:

square(x)

def square(x):
    "Computes square of a number."
    return x*x

help(square)

Help on function square in module __main__:

square(x)
    Computes square of a number.

def square(x):
    """Computes square of a number.
    
        >>> square(3)
        9
    """
    return x*x

help(square)

Help on function square in module __main__:

square(x)
    Computes square of a number.
    
    >>> square(3)
    9

square?

%%file mymodule4.py
"""This is mymodule4. 

Written to demonstrate docstrings.
"""
x = 1

def add(a, b):
    """Adds two numbers.
    
        >>> add(3, 4)
        7
    """
    return a+b

if __name__ == "__main__":
    # Run the following code only when this file is 
    # executed as a script.
    # Ignore this when imported as a module.
    print(add(3, 4))

Writing mymodule4.py

help("mymodule4")

Help on module mymodule4:

NAME
    mymodule4 - This is mymodule4.

DESCRIPTION
    Written to demonstrate docstrings.

FUNCTIONS
    add(a, b)
        Adds two numbers.
        
        >>> add(3, 4)
        7

DATA
    x = 1

FILE
    /Users/anand/trainings/2016/cognizant/mymodule4.py

Q: What is from module import something?

import wc2
print(wc2.linecount("a.txt"))

3

wc2

<module 'wc2' from '/Users/anand/trainings/2016/cognizant/wc2.py'>

from wc2 import linecount

linecount("a.txt")

3

from wc2 import linecount as lc

lc("a.txt")

3

import wc2 as wc
wc.linecount("a.txt")

3

import time as t
t.asctime()

'Wed Mar 16 12:44:05 2016'

Q: What is difference between package and a module?

package is basically a nested module, containing more modules inside it.

Lets try to create utils package.

!mkdir utils

%%file utils/square.py
"""The square module."""

def square(x):
    return x*x

Writing utils/square.py

%%file utils/cube.py
"""The cube module."""

def cube(x):
    return x*x*x

Overwriting utils/cube.py

%%file utils/__init__.py
"""The utils package.

This provides square and cube modules.
"""

Writing utils/__init__.py

!tree utils/

utils/
|-- __init__.py
|-- cube.py
`-- square.py

0 directories, 3 files

from utils.square import square
square(4)

16

from utils.cube import cube
cube(4)

64

help("utils")

Help on package utils:

NAME
    utils - The utils package.

DESCRIPTION
    This provides square and cube modules.

PACKAGE CONTENTS
    cube
    square

FILE
    /Users/anand/trainings/2016/cognizant/utils/__init__.py

help("utils.square")

Help on module utils.square in utils:

NAME
    utils.square

FUNCTIONS
    square(x)

FILE
    /Users/anand/trainings/2016/cognizant/utils/square.py

Dictionaries¶

d = {"x": 1, "y": 2}

d['x']

1

d['y']

2

d['x'] = 11

d

{'x': 11, 'y': 2}

d['z'] = 3

print(d)

{'x': 11, 'y': 2, 'z': 3}

person = {
    "name": "Alice",
    "email": "alice@example.com",
    "phone": "9876500012"
}

Dictionary can also be created by passing key-value pairs to dict function.

dict([("x", 1), ("y", 2), ("z", 3)])

{'x': 1, 'y': 2, 'z': 3}

Lets try a simple example.

%%file prices.txt
apple 20
mango 40
banane 30

Writing prices.txt

def load_prices(filename):
    prices = {}
    for line in open(filename):
        name, price = line.strip().split()
        prices[name] = int(price)
    return prices

prices = load_prices("prices.txt")

prices["apple"]

20

%%file inventory.txt
notebook 100
pen 58
pencil 83

Writing inventory.txt

inventory = load_prices("inventory.txt")

inventory['notebook']

100

inventory.get("notebook", 0)

100

inventory.get("ruler", 0)

0

Q: How to check if a key in present in a dictionary?

"notebook" in inventory

True

"ruler" in inventory

False

inventory.keys()

dict_keys(['pencil', 'notebook', 'pen'])

inventory.values()

dict_values([83, 100, 58])

inventory.items()

dict_items([('pencil', 83), ('notebook', 100), ('pen', 58)])

for k in inventory.keys():
    print(k)

pencil
notebook
pen

for k in inventory:
    print(k)

pencil
notebook
pen

for v in inventory.values():
    print(v)

83
100
58

for k,v in inventory.items():
    print(k, v)

pencil 83
notebook 100
pen 58

Problem: Given a file containing the prices of each product and another file containing the products purchased and their quantity, write a program to generate a bill for the purchases.

$ python bill.py prices.txt purchases.txt
mango 5 40 200
apple 2 20 40
banana 4 30 120
TOTAL 360

%%file prices.txt
apple 20
mango 40
banana 30

Overwriting prices.txt

%%file purchases.txt
mango 5
apple 2
banana 4

Writing purchases.txt

def load_dict(filename):
    prices = {}
    for line in open(filename):
        name, price = line.strip().split()
        prices[name] = int(price)
    return prices

Example: Word Count¶

%%file words.txt
five
five four
five four three
five four three two
five four three two one

Writing words.txt

%%file wordfreq.py
"""Program to compute frequency of words in a file.

USAGE: python wordfreq.py filename.txt
"""
import sys

def read_words(filename):
    """Reads words from a file."""
    return open(filename).read().split()

def wordfreq(words):
    """Computes frequency of each words from the given words.
    """
    freq = {}
    for w in words:
        freq[w] = freq.get(w, 0) + 1
    return freq

def print_freq(freq):
    """Prints the frequency of words.
    """
    # TODO: FIXME
    print(freq)

def main():
    filename = sys.argv[1]
    words = read_words(filename)
    freq = wordfreq(words)
    print_freq(freq)
    
if __name__ == "__main__":
    main()

Overwriting wordfreq.py

!python wordfreq.py words.txt

{'two': 2, 'one': 1, 'four': 4, 'five': 5, 'three': 3}

Problem: Improve the above program to print one word per line, like the following:

two 2
one 1
four 4
five 5
three 3

Problem: Improve the above program further to print the words sorted by count, with the most common word on the top.

five 5
four 4
three 3
two 2
one 1

+Problem: Write a program extcount.py to count the number of files per extension in the given directory. The program should take path to a directory as command argument and print the count and extension for each available extension.

$ python extcount.py foo/
14 py
2 txt
1 csv

Can you reuse the wordfreq function implemented in the above example, by importing it as a module?

Classes¶

class Point:
    def __init__(self, x=0, y=0):
        self.x = x
        self.y = y

p = Point()
print(p.x, p.y)

0 0

q = Point(2, 3)
print(q.x, q.y)

2 3

class Point:
    def __init__(self, x=0, y=0):
        self.x = x
        self.y = y
        
    def getx(self):
        return self.x
    
    def display(self):
        print(self.x, self.y)
        
    def add(self, p):
        x = self.x + p.x
        y = self.y + p.y
        return Point(x, y)
    
p1 = Point(1, 2)
p2 = Point(3, 4)

p1.display()
p2.display()
print(p1.getx())

p3 = p1.add(p2)
p3.display()

1 2
3 4
1
4 6

Problem: Write a method double that returns a new Point with both x and y coordinates doubled.

>>> p = Point(2, 3)
>>> q = p.double()
>>> q.display()
4 6

p = Point(1, 2)

p.x

1

p.y

2

p.z = 3

p.z

3

p.__dict__

{'x': 1, 'y': 2, 'z': 3}

p.__class__

__main__.Point

Q: Can a method of a class access the dynamically added attributes?

class Foo:
    def getx(self):
        return self.x

f = Foo()
f.getx()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-c25cfc827fb5> in <module>()
      1 f = Foo()
----> 2 f.getx()

<ipython-input-14-27f09e900618> in getx(self)
      1 class Foo:
      2     def getx(self):
----> 3         return self.x

AttributeError: 'Foo' object has no attribute 'x'

f.x = 1
f.getx()

1

class DummyFile:
    def read(self):
        return "one two three four"

def read_words(fileobj):
    return fileobj.read().split()

read_words(open("words.txt"))

['five',
 'five',
 'four',
 'five',
 'four',
 'three',
 'five',
 'four',
 'three',
 'two',
 'five',
 'four',
 'three',
 'two',
 'one']

read_words(DummyFile())

['one', 'two', 'three', 'four']

Example: CSVParser¶

class CSVParser:
    def __init__(self, delimiter=",", comment_indicator="#"):
        self.delimiter = delimiter
        self.comment_indicator = comment_indicator
        
    def parse(self, filename):
        return [line.strip("\n").split(self.delimiter) 
                for line in open(filename)
                if not line.startswith(self.comment_indicator)
                   and line.strip() != ""]

csv_parser = CSVParser(delimiter=",")
tsv_parser = CSVParser(delimiter="\t")
special_parser = CSVParser(delimiter=":", comment_indicator=",")

csv_parser.parse("a.csv")

[['A1', 'B1', 'C1'], ['A2', 'B2', 'C2'], ['A3', 'B3', 'C3']]

Exception Handling¶

no_such_variable

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-32-b7c1357f8e68> in <module>()
----> 1 no_such_variable

NameError: name 'no_such_variable' is not defined

open("nofile.txt")

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-33-dba134cf36ca> in <module>()
----> 1 open("nofile.txt")

FileNotFoundError: [Errno 2] No such file or directory: 'nofile.txt'

int("not-a-number")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-ac8f40c8d19c> in <module>()
----> 1 int("not-a-number")

ValueError: invalid literal for int() with base 10: 'not-a-number'

Lets try an example.

def read_file(filename):
    """Returns contents of the file.
    
    If the file is not found or if there is any error
    in reading the file, returns empty string.
    """
    try:
        return open(filename).read()
    except FileNotFoundError:
        return ""

read_file("a.txt")

'one\ntwo\nthree\n'

read_file("nofile.txt")

''

Problem: Write a function safeint to convert given string into an integer. The function should accept two arguments, the string to convert and a default value. If the given string is not a valid integer, the default value should be returned.

>>> safeint("3", 0)
3
>>> safeint("NA", 0)
0

Problem: Improve the sumfile.py we wrote earlier to ignore the invalid numbers after printing a waring message.

$ python sumfile.py num.txt
WARNING: Bad number 'N/A'
WARNING: Bad number 'xxx'
15

%%file num.txt
1
2
3
N/A
4
xxx
5

Writing num.txt

%%file sumfile.py
import sys
filename = sys.argv[1]

def safeint(value, default):
    try:
        return int(value)
    except ValueError:
        print("WARNING: Bad Number", repr(value))
        return default

numbers = [safeint(line, 0) for line in open(filename)]
print(sum(numbers))

Overwriting sumfile.py

!python sumfile.py num.txt

WARNING: Bad Number 'N/A\n'
WARNING: Bad Number 'xxx\n'
15

Why Python 3?¶

There are lot of nice features coming up in Python 3.

def add(x: int, y: int) -> int:
    return x+y+"hello"

Machine Learning with Python - Day 3¶

Working with Files¶

Example: Word Count¶

Writing to Files¶

The with Statement¶

Binary and Text¶

Example: Reading CSV files¶

Writing Custom Modules¶

The __name__ magic variable¶

Docstrings¶

Dictionaries¶

Example: Word Count¶

Classes¶

Example: CSVParser¶

Exception Handling¶

Why Python 3?¶

The `with` Statement¶

The `name` magic variable¶