Your Blog in House

Dealing with xls files in Python

It’s been a long day coding. I’ve nearly finished all the preliminary scrapers for each state parliament and I must say I’ll be glad to move onto the next phase.

I thought it’s probably worth outlining my biggest coding struggle today just in case anyone else ever has the same issues I did.

To date, all mailing lists that I’ve processed have been csv files, which are wonderful because they are incredibly simple and for the most part just don’t need to be any more complex than they are. However, it would seem the Parliament of Western Australia disagrees and to hell with anyone that operates outside a Microsoft environment. I’ll spare you the diatribe about the importance of governments being platform neutral…

So I had to process an xls file which proved to be quite a bit more difficult than I thought it might be although in the end the solution was quite simple:

import requests
import os
import xlrd

class WaData(BaseData):


    def getData(self, url):
        """ Returns dictionary of CSV Data """
        csvfile = requests.get(url, stream=True)
        # create temp file
        temp_file = 'temp.xls'
        with open(temp_file, 'wb') as f:
            for chunk in csvfile.iter_content():
        book = xlrd.open_workbook(temp_file)
        # get excel 'sheets'
        sheet_names = book.sheet_names()
        # get first sheet
        sheet = book.sheet_by_name(sheet_names[0])
        # get header row
        header = sheet.row_values(0)
        # create list of dictionary values skipping the first one
        values = []
        for rownum in range(1, sheet.nrows):
                dict(zip(header, sheet.row_values(rownum)))
        return values

There are some odd choices in here which is it worth explaining.

Firstly, I create a temp file ‘manually’ rather than using the tempfile library. That’s because xlrd doesn’t seem to like the temp files created with tempfile. tempfile creates a ‘file-like’ object which seems to exist as a buffer and xlrd doesn’t want to open them.

I also wrote the content to the temp file using the requests inter_content() method as it just seemed to work a bit nicer given it’s not plain text that I’m writing to the file.

Then, while I’m sure it’s an excel/Microsoft issue, rather than an xlrd one, finding your way through the book object is a bit of a pain. So I get the sheet names of the book, then get the first one returned (lines 20 – 22).

I then loop over the rows in the sheet using an index and create a dictionary by combining the header row with the values in each row (lines 26 – 30).

Finally I clean up the temp file. Easy really.

Categories: Coding, Data

Tags: , , , ,

Be the first to comment

Upgrading from Python 2 to Python 3

I’ve always liked to think of myself as an early adopter. However to date, I’ve been using Python 2 and it’s been bugging me. We’re up to Python 3.5 already.

I think the biggest barrier for me has always been that OS X’s default is still 2.7 but it’s not like that’s difficult to get around. I’m mean, I am a mad hacker after all!

Getting Python 3 on your OS X machine is super easy thanks to Homebrew:

§ brew install python3

Don’t worry it just installs Python 3 right next to Python 2.7.

To find out where it is installed it’s as simple as:

§ which python3

And once you know that you can easily setup a virtualenvwrapper that used Python 3 so you never have to think about it again…

§ mkvirtualenv --python=/usr/bin/python3 nameOfEnvironment

My next step came from a hot tip from the Neck Beard Republic’s fantastic screen cast on Converting Python2 to Python3. He suggests you use a tool called 2to3 which scans a specified file or directory for changes required to upgrade.

So what work was required to upgrade my app from Python 2 to 3?

There were a couple of print statements that I needed to turn into functions. Obvs. I’m still working on the required muscle memory for that one.

Then there was relative imports. The 2to3 scan recommended adding a dot (.) to a range of imports where I’m importing modules from my app. I’m still getting my head around this one and when I actually ran my code and tests I found I had to include the module name as well as the dot.

So instead of:

from .base import BaseTestCase

I found I had to:

from tests.base import BaseTestCase

Strangely I haven’t had to do that for the scrapers though. As I say, I’m not sure I’ve fully got my head around that one.

Then there was a change in the urllib package but that was a simple fix just by changing the imports from:

from urllib import quote


from urllib.parse import quote

There were also a few unicodes but that was more or less a bulk find and replace of ‘unicode’ for ‘str’ as str has replaced unicode in Python 3 where unicode is now (quite rightly) the default encoding.

The zip() function now returns a generator rather than a list. I’m not sure if it would have made any difference but I changed

query = zip(conditions[0::2], conditions[1::2])


query = list(zip(conditions[0::2], conditions[1::2]))

And the last thing I need to change was the way I handle the response of a csv file from the Requests module hands a file off to the csv.DictReader() method. I was getting an error saying:

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

As it turns out this is related to the encoding changes in Python 3. So what, in Python 2, looked like this:

def getCsvData(self, file):
        ''' Returns dictionary of CSV Data '''
        csvfile = requests.get(file, stream=True)
        return csv.DictReader(csvfile.raw)

Now looks like this:

def getCsvData(self, file):
        ''' Returns dictionary of CSV Data '''
        csvfile = requests.get(file, stream=True)
        return csv.DictReader(StringIO(csvfile.text))

(with an import of StringIO from the io module).

The final thing left to do was to reinstall the relevant Python modules using pip. This primarily meant that wsgiref was excluded this time which I’m pretty sure was/is part of the Flask dependencies and presumably not required with Python 3.

Realistically, this took be about 45 mins and has, so far, been pretty hassle free. A couple of Google Searches to figure out the StringIO thing was the most onerous and now I’m back to being an early adopter. There’s pretty handy guid here and you can see most of the changes I had to make in this commit.


I forgot to mention one other error that I got that it is probably worth mentioning.

When running my tests I got an error saying:

AssertionError: Popped wrong request context.

It turns out this is related to Flask-Test, context preservation and Python 3 (I couldn’t find a clear reason why in anything I read). To fix it I just added the following to my file:


Categories: Python

Tags: , , , , , , , , ,

Be the first to comment

Database Listeners

I’m starting to work on a system of checks to ensure the database is up to date. The way I see it, the checks will run on a fairly regular cron and generate ‘issues’ where Data doesn’t line up.

Users will also have a way of generating ‘issues’ so there’s a human element to checking the data integrity as well.

As it stands though, that would mean that I have to check a database table regularly. Surely it would be better to get an email or something every time there was a new entry in the database.

And as it turns out there’s a pretty easy way to do that using SQLAlchemy and Flask.

Here’s a truncated version of my file:

from yvih import db
from sqlalchemy import event
from sqlalchemy.orm import mapper
import smtplib

class Data(db.Model):
    __tablename__ = 'data'

    id = db.Column(db.Integer, primary_key=True)
    name = db.Column(db.String)
    email = db.Column(db.String)
    issue = db.Column(db.Text)
    member_id = db.Column(db.Integer, db.ForeignKey(''))
    status = db.Column(db.Integer)

    def __init__(self, name, email, issue, member_id, status = 0): = name = email
        self.issue = issue
        self.member_id = member_id
        self.status = status

def data_listener(mapper, connection, target):
    ''' send email to alert that new issue has been raised '''
    message = """From: From Person
    To: To Person
    Subject: A new issue has been generated

    message = message % (target.issue)
    # @todo: send email from here

event.listen(Data, 'after_insert', data_listener)

The first thing to note is the includes. ‘event‘ need to be include from ‘sqlalchemy’ and ‘mapper‘ from ‘sqlalchemy.orm’.

Essentially events listen to mapped classes and execute a function when an event occurs on that class.

So in this case, I’ve set up a listener on the ‘Data’ class and am listening for the ‘after_insert’ event which then executes the ‘data_listener’ function.

The event listener will pass three arguments to the listening function so make sure there are three variables passed to the function. The third argument is the data that is being saved so in the case, I can use it to email the issue.

Categories: Coding, Data, Flask, Python

Tags: , , , ,

Be the first to comment


I’ve created a ‘Rersources’ repository on Github that I’ll add to with time.

It may be of assistance/interest to those of you that are a little more code-inclined.

You can find it at:

Categories: Coding

Tags: ,

Be the first to comment

Dealing with dodgy markup

Lately I’ve been working on scraping various parliamentary websites to collect all the data I need for the YVIH API.

I might write up how I’ve gone about this process in more detail in another post as there’s plenty to discuss. However today’s post is about dealing with dodgy markup.

Of course it would be really great if the websites of Australian government adhered to W3C standards. I can assure you that at the very least, both Queensland and the ACT’s ‘current members’ pages do not. In fact they both fail quite badly and are riddled with stray tags and other markup errors.

This is maddening. Government bodies need to be getting this sort of thing right. It’s an accessibility issue for both computers and people using screen reading devices. If it isn’t already, it should be a contractual requirement for anyone building a website for a government body that the work is standards compliant.

From my perspective this becomes an issue when trying to scape data. Beautiful Soup does a remarkable job of interpreting HTML and has relatively high degree of tolerance for malformed markup. However sometimes it doesn’t quite do what you want it to do and in my experience Beautiful Soup tends to assume that a tag has closed when it hits something it shouldn’t. This has the effect of not being able to find anything in the document or section from that point on.

After a bit of digging around though I found that Beautiful Soup supports a range of third party parsers. One of them is html5lib which parses pages the same way a web browser does and creates valid HTML5. Perfect.

So I installed html5lib using pip:

pip install html5lib

Then instead of processing the page using Beautiful Soup with the standard library:

page = requests.get(link).content
soup = BeautifulSoup(page)

I tell it to use html5lib:

page = requests.get(link).content
soup = BeautifulSoup(page, "html5lib")

It worked for me on the Queensland members page and I’m hopeful it will work for the ACT as well. If not, there’s a couple of other parsers that I can try.

Categories: Data

Tags: , ,

Be the first to comment

Building queries with Flask-SQLAlchemy

I guess it’s probably about time I wrote about a bit of code on this little blog of mine.

SQLAlchemy is a ‘Python SQL Toolkit and Object Relational Mapper’. As we know Flask is pretty barebones but there’s plenty of extensions including one for Flask-SQLAlchemy. And it’s pretty handy. When I first started playing around with this app I tried writing my own basic Model class that used the built in database functions for Flask. But it wasn’t long before all roads were pointing to SQLAlchemy – especially if I ever wanted to upgrade from SQLite at some stage with a minimum of fuss.

It’s powerful and allows you to create models easily and do just about everything you want straight out of the box.

The hard part though, it the rather limited documentation on the Flask-SQLAlchemy site. You’re therefore made to cross reference with the standard SQLAlchemy site where the implementation can be just different enough to be quite confusing.

Given that, I found it quite difficult to find figure out how to build a site with dynamically generated queries based on URL strings – you know, the sort that are central to APIs.

So I thought I’d just post my code here and step through it to explain what I’ve done:

from flask import Blueprint, abort
from yvih import db
from yvih.models import Member
from sqlalchemy.sql.expression import or_


def members( conditions=None ):
    # if there are parameters
    if conditions != None:

        # get query and turn it into a dictionary
        conditions = conditions.split('/')
        query = zip(conditions[0::2], conditions[1::2])

        # check that all fields are valid
        if not parameter_accepted(query):

        # build query
        filters = []
        for conditions in query:
            if conditions[0] == 'id':
                for term in conditions[1].split(','):
                    filters.append(Member.__dict__[conditions[0]].ilike('%' + term + '%'))
        members = Member.query.filter( or_(*filters) )

I’m obviously using Blueprints here but more about that later.

The important bit is the query building. Firstly I take the ‘path’ that forms the query and split it into a list of tuple that form key-value pairs using the zip() function. It needs to be a tuple rather than a dictionary because it’s possible that a the query could contain two of the same keys. The url could be /members/id/1/second_name/Smith/id/20. If it were a dictionary, the id 20 would override id 1.

Then there’s a function that tests the parameters are correct and throws a ‘Bad request 400′ error if, for example, a non-number has been passed as an id or one of the keys isn’t in the list of possible keys.

I then create an empty filter list and loop over the list of tuples I created from the url query. If it’s an id that has been passed, I split it into a list (ids can be in csv format) using the in_() method. Otherwise I loop over the other values (which can also be comma separated).

This time I use the __dict__ method of Member using the ‘key’ in the tuple as they key for the dictionary. Then ilike which is essentially a LIKE SQL condition.

Finally these rules are all joined together with the or_ method and filters is passed in with the asterisk to denote that it’s an argument list (rather than just one argument). You’ll notice I had to import or_ from sqlalchemy.sql.expression which was quite a revelation for me. Not surprisingly, it means that all the filters are joined with an OR condition which is what we want in this case.

Anyway, hope that helps someone. If I could have done it better, please let me know in the comments. This query business has a long way to go in my app development but this is a good start.

Categories: Flask

Tags: , ,

Be the first to comment

Flask vs Django

Deciding how you are going to put your website together is clearly an important decision. You generally need to make that decision up front and it will go on to haunt you for some time. As I’ve already mentioned I’ve decided to go with Python rather than PHP, but I’ve also chosen to use the Flask web development framework.

Generally speaking I’m firmly in the pro-framework camp. Any added ‘bloat’ that they add is firmly offset by the time one saves. Moreover, there’s a certain amount of forethought that goes into any half decent framework that the developer greatly benefits from.

I’ve fiddled around the edges with a range of frameworks for a while now. I think I’ve done the Django tutorial three or four times now. I’ve also had a tinker with Web2py, Bottle and Flask and have ultimately ended up really loving Flask.

But firstly, why not Django? I really like Django, it’s hard not to. It’s full MVC which works really well for me having used a few MVC frameworks in PHP. It also just takes care of all sorts of things like models and users/logins out of the box which is a massive plus.

But for this project I basically want to build an API, at least on the server/Python side of things. Javascript will take care of the front end, I just need a web app that outputs JSON.

In the end I settled on Flask. From when I first started looking at it to now it really seems to have taken off. The other micro-frameworks seem dated in comparison (not sure if that’s actually true). It just gets out of your way and lets you crack on with things. You can structure your app however you want and there are plenty of addons for it – for example SQLAlchemy which certainly makes database management a lot easier.

It’s also just super quick to throw something together with. When I wanted to put some photos on a map after a holiday in Scotland, it took me a few hours to throw it together (code here). I wanted to do a few things with Soundcloud like add filtering – a few hours later a friend and I had build a little app that meant we could filter by track duration amongst other things (code here).

The biggest downside though is that the documentation for all these plugins (and even Flask itself sometimes) is quite poor and has considerable gaps. Which is one of the reasons I’m writing this blog – I’ll share what I find difficult to find the answer to in the hope that it helps someone else.

So there you have it. I’d like to know more about Python frameworks. Are there MVC alternatives to Django for example? What are the other big players out there?

Categories: Flask, Python

Tags: , , , , , , ,

Be the first to comment

Python vs PHP

One of the reasons I thought I’d start this blog is because I’m using the Your Voice in House project to teach myself a few new tricks. As I’m sure any developer will tell you, having a little project is the best way to learn.

I’ve been coding since the early 2000s. Self taught, my primary programming language has always been PHP although I added Javascript to my tool-belt early on in the piece (when jQuery was a tiny baby and there was much deliberation about whether it was worth adding that much overhead to your site). When I first started using PHP it was version 4 and I remember feeling a little put out when I had to get my head around this Object Oriented stuff they introduced in 5. How far I have come.

PHP has plenty of detractors and there is no doubt, like every programming language, it has its short comings. But I’m not a PHP naysayer and I don’t want this post to get into the ‘is PHP even worthy of being called a programming language’ debate. Most anti-PHP sentiment is out of date these days anyway. The language has come a very long way – particularly in the last 5 years or so.

For me, particularly as someone learning to program 10 or 12 years ago, the greatest thing about PHP was the LAMP stack. It just makes everything easy. You put a file in a directory on the web root and when you navigate to that file, that file is run and you get something on the screen – including error messages. In addition, even the most basic hosting will already have a MySQL database and PHP just plugs right in there as well with a minimum of fuss. That is ideal when you’re just learning and don’t want to have to get your head around hosting environments before you get into the code.

The other fantastic thing about PHP is that it has a huge – and largely friendly – user base which means help is never far away and for the most part the documentation is excellent. I still visit just about every day.

But if you’ve been over to Github, you’ll notice that I’m primarily building this site in Python.

I’ve been playing around with Python for a few years now and thought I may as well use it for this project so I can get to know it a bit better. I’m still picking it up and just need to spend a decent amount of time with it to get to know it a bit better.

Python is fantastic. It’s very powerful, syntactically very pretty and it’s also much more of a general purpose language than PHP is. I know you can use PHP for more than just building websites but it just feels at home on a website. Python on the other hand is happy anywhere and I find myself using it more and more to automate processes on my laptop amongst other things.

Plus it makes a whole bunch of other things easier. I love the BeautifulSoup library that make scraping a website – which will be important for the YVIH site – so much easier. For the most part PHP is lacking that sort of thing. I know composer is changing that but I feel like that’s still a long way off what Python has to offer in that respect.

I don’t know if Python will ever be my primary language and I’m nervous about taking that step in may senses. I’m not yet game to use it for a freelance job and I do wonder if it will ever be my primary language at work.

Whatever the case, this blog will be focusing on Python (and some Javascript). Hopefully I’ll be able to share a few tips as I learn as well.

Categories: Coding, Python

Tags: , ,

1 comment

New Beginings

So, first posts hey. Hello World…

Many moons ago I founded the now defunct Australian Centre for Democracy and Justice. We were a small activist organisation that ran a range of campaigns and regular talks on a wide range of topics.

One of the campaigns we ran was ‘Your Voice in House’ which was a database of every politician in Australia. Users, could search for their local member or a certain minister and use the site to get in contact with them.

The biggest problem with the site was that it required a lot of maintenance. Politicians change contact details and portfolios all the time. Multiply that by 8 State and Territory governments plus a Federal government and it soon became unwieldy and ultimately I gave up. The process was manual and I had no real way of knowing if any of the data was out of date.

Fastforward quite a few years, I’m living in London, working as a web developer and still think it’s a good idea. So I start plugging away at it again.

So I’ve got a start on it, you can find the repository at Github (pull requests welcome), and I thought it probably wasn’t a bad idea to document the process.

There’s a lot of learning that I’m doing in a range of fields and here I am laying it bare in the hope that it’s vaguely interesting – maybe even helpful to some one.

Progress will be slow as I do this in my spare time and posts will probably be infrequent – are your expectations low enough yet?

Categories: Musing

Be the first to comment