Messing With Government Data Using Python

Messing With Government Data Using Python


PyCon India 2014
September 27-28, 2014





Anand Chitipothu
@anandology

Presenter Notes

During the General Elections 2014...

I volunteered to provide technical assitance to an election campaign in Bangalore (and also in Andhra Pradesh).

Presenter Notes

And I ended up building...

  • A campaign management system
  • volunteer signup system
  • webapp to find voter details by voterid
  • script to format voter lists of a polling center as PDF in compact form
  • and other small little tools

GitHub Activity

Presenter Notes

Glossary

  • Paliamentary Constituency (PC25 - Bangalore North)
  • Assembly Constituency (AC158 - Hebbal)
  • Ward (W046 - Jayachamarajendra Nagar)
  • Polling Center
    • Typically a school/govt building containing one or more polling booths
    • E.g. PX065 - Adarsha Vidya Mandira, R T Nagara
  • Polling Booth
    • E.g. PB0203 - Adarsha Vidya Mandira, Room No-1
    • Typically have about 1000 voters
  • VoterID
    • unique (supposed to be) identifier for a voter

Presenter Notes

The Challenges

Presenter Notes

The Campaign Management System

Presenter Notes

The Campaign Management System

Presenter Notes

Presenter Notes

Volunteer Sign Up System

Presenter Notes

Volunteer Sign Up System

Presenter Notes

Volunteer Sign Up System

Presenter Notes

Find Your Polling Booth

Presenter Notes

Compact Voter List

Presenter Notes

Important Polling Centers

Presenter Notes

The Fun Part

Presenter Notes

Parsing HTML pages

  • Beautiful Soup is your friend
  • Always save intermediate results
  • ASP.net is the worth thing ever happened to web

Presenter Notes

BeautifulSoup

from bs4 import BeautifulSoup
import urllib2

def parse(html):
    soup = BeautifulSoup(html)

    # find all tds in a table
    rows = soup.select("#ctl00_ContentPlaceHolder1_GridView1 tr")

    # extract text for all rows except the header row
    for tr in rows[1:]:
        tds = tr.find_all("td")
        yield [td.get_text() for td in tr.find_all("td")]

URL = ("http://ceokarnataka.kar.nic.in/ElectionFinalroll2014/" + 
       "Part_List.aspx?ACNO=158")

html = urllib2.urlopen(URL).read()
data = parse(html)

Presenter Notes

Save Intermediate Results

@cache.disk_memoize("cache/wp.html")
def get_wp_page():
    return urllib2.urlopen(WP_URL).read()

@cache.disk_memoize("cache/table_{0}.json")
def get_table_for_state(state):
    ...

@cache.disk_memoize("cache/{state_name}_pc.tsv")
def get_pc_list(state_name):
    return [['PC{0:02d}'.format(int(row[0])), row[1].strip()] 
            for row in get_table_for_state(state_name)]

Presenter Notes

Disk Memoize

@cache.disk_memoize("cache/MP/districts.json")
def get_districts(self):
    ...

@cache.disk_memoize("cache/MP/AC{ac:03d}_booths.tsv")
def get_booths_of_ac(self, dist, ac):
    ...

@cache.disk_memoize("cache/map/{1[state]}/district_{1[district]}_acs.json")
def get_district_acs(self, district):
    ...

Presenter Notes

The Hell of ASP.net

Presenter Notes

Escaping the Hell of ASP.net

@cache.disk_memoize("cache/MP/districts.json")
def get_districts(self):
    return self.browser.get_select_options("ddlDistrict")

@cache.disk_memoize("cache/MP/AC{ac:03d}_booths.tsv")
def get_booths_of_ac(self, dist, ac):
    self.browser.select_option('ddlDistrict', dist)
    self.browser.select_option('ddlAssembly', ac)
    soup = self.browser.get_soup()
    ...

Presenter Notes

Parsing PDFs

Presenter Notes

Extracting Ward Info

  • pdftotext -layout a.pdf a.txt

Presenter Notes

Extracting Ward Info

def parse_ward(self):
    section = self.read_section(self.text, 
        "2. DETAILS OF PART & POLLING AREA", 
        "3. POLLING STATION DETAILS")
    start_index = self.get_column_index(section, 
        "Ward No.", "Taluka", "Police Station", "District")
    text = self.select_window(section, start_index, 1000)
    ward_info = self.extract_text(text, "Ward No.", "Police Station")

Presenter Notes

Extracting Ward Info

def parse_ward(self):
    section = self.read_section(self.text, 
        "2. DETAILS OF PART & POLLING AREA", 
        "3. POLLING STATION DETAILS")
    start_index = self.get_column_index(section, 
        "Ward No.", "Taluka", "Police Station", "District")
    text = self.select_window(section, start_index, 1000)
    ward_info = self.extract_text(text, "Ward No.", "Police Station")

Presenter Notes

Extracting Ward Info

def parse_ward(self):
    section = self.read_section(self.text, 
        "2. DETAILS OF PART & POLLING AREA", 
        "3. POLLING STATION DETAILS")
    start_index = self.get_column_index(section, 
        "Ward No.", "Taluka", "Police Station", "District")
    text = self.select_window(section, start_index, 1000)
    ward_info = self.extract_text(text, "Ward No.", "Police Station")

Presenter Notes

Extracting Ward Info

def parse_ward(self):
    section = self.read_section(self.text, 
        "2. DETAILS OF PART & POLLING AREA", 
        "3. POLLING STATION DETAILS")
    start_index = self.get_column_index(section, 
        "Ward No.", "Taluka", "Police Station", "District")
    text = self.select_window(section, start_index, 1000)
    ward_info = self.extract_text(text, "Ward No.", "Police Station")

Presenter Notes

Ward and PC Boundaries

Thanks to Open Bangalore and DataMeet group for map boundaries.

Presenter Notes

Formatting Voter Lists

  • ReportLab works fine.
  • Beware of performance issues.

Presenter Notes

Post Elections

I continued to mess with more government data and improve the system.

GitHub Activity

Presenter Notes

Summary

  • Messing with government data is challenging and fun
  • BeautifulSoup is good enough for parsing HTML (even ASP.net websites!)
  • Saving intermediate results saves you lot of time
  • Parsing PDFs is a bit tough, but not impossible
  • All this data will be very valuable when made available openly

Presenter Notes

Presenter Notes