Your Blog in House

Dealing with dodgy markup

Lately I’ve been working on scraping various parliamentary websites to collect all the data I need for the YVIH API.

I might write up how I’ve gone about this process in more detail in another post as there’s plenty to discuss. However today’s post is about dealing with dodgy markup.

Of course it would be really great if the websites of Australian government adhered to W3C standards. I can assure you that at the very least, both Queensland and the ACT’s ‘current members’ pages do not. In fact they both fail quite badly and are riddled with stray tags and other markup errors.

This is maddening. Government bodies need to be getting this sort of thing right. It’s an accessibility issue for both computers and people using screen reading devices. If it isn’t already, it should be a contractual requirement for anyone building a website for a government body that the work is standards compliant.

From my perspective this becomes an issue when trying to scape data. Beautiful Soup does a remarkable job of interpreting HTML and has relatively high degree of tolerance for malformed markup. However sometimes it doesn’t quite do what you want it to do and in my experience Beautiful Soup tends to assume that a tag has closed when it hits something it shouldn’t. This has the effect of not being able to find anything in the document or section from that point on.

After a bit of digging around though I found that Beautiful Soup supports a range of third party parses. One of them is html5lib which parses pages the same way a web browser does and creates valid HTML5. Perfect.

So I installed html5lib using pip:

pip install html5lib

Then instead of processing the page using Beautiful Soup with the standard library:

page = requests.get(link).content
soup = BeautifulSoup(page)

I tell it to use html5lib:

page = requests.get(link).content
soup = BeautifulSoup(page, "html5lib")

It worked for me on the Queensland members page and I’m hopeful it will work for the ACT as well. If not, there’s a couple of other parses that I can try.


Categories: Data

Tags: , ,

Be the first to comment

Building queries with Flask-SQLAlchemy

I guess it’s probably about time I wrote about a bit of code on this little blog of mine.

SQLAlchemy is a ‘Python SQL Toolkit and Object Relational Mapper’. As we know Flask is pretty barebones but there’s plenty of extensions including one for Flask-SQLAlchemy. And it’s pretty handy. When I first started playing around with this app I tried writing my own basic Model class that used the built in database functions for Flask. But it wasn’t long before all roads were pointing to SQLAlchemy – especially if I ever wanted to upgrade from SQLite at some stage with a minimum of fuss.

It’s powerful and allows you to create models easily and do just about everything you want straight out of the box.

The hard part though, it the rather limited documentation on the Flask-SQLAlchemy site. You’re therefore made to cross reference with the standard SQLAlchemy site where the implementation can be just different enough to be quite confusing.

Given that, I found it quite difficult to find figure out how to build a site with dynamically generated queries based on URL strings – you know, the sort that are central to APIs.

So I thought I’d just post my code here and step through it to explain what I’ve done:

from flask import Blueprint, abort
from yvih import db
from yvih.models import Member
from sqlalchemy.sql.expression import or_

....

@members_blueprint.route('/')
@members_blueprint.route('/<path:conditions>')
def members( conditions=None ):
    # if there are parameters
    if conditions != None:

        # get query and turn it into a dictionary
        conditions = conditions.split('/')
        query = zip(conditions[0::2], conditions[1::2])

        # check that all fields are valid
        if not parameter_accepted(query):
            abort(400)

        # build query
        filters = []
        for conditions in query:
            if conditions[0] == 'id':
                filters.append(Member.id.in_(conditions[1].split(',')))
            else:
                for term in conditions[1].split(','):
                    filters.append(Member.__dict__[conditions[0]].ilike('%' + term + '%'))
        members = Member.query.filter( or_(*filters) )
...

I’m obviously using Blueprints here but more about that later.

The important bit is the query building. Firstly I take the ‘path’ that forms the query and split it into a list of tuple that form key-value pairs using the zip() function. It needs to be a tuple rather than a dictionary because it’s possible that a the query could contain two of the same keys. The url could be /members/id/1/second_name/Smith/id/20. If it were a dictionary, the id 20 would override id 1.

Then there’s a function that tests the parameters are correct and throws a ‘Bad request 400′ error if, for example, a non-number has been passed as an id or one of the keys isn’t in the list of possible keys.

I then create an empty filter list and loop over the list of tuples I created from the url query. If it’s an id that has been passed, I split it into a list (ids can be in csv format) using the in_() method. Otherwise I loop over the other values (which can also be comma separated).

This time I use the __dict__ method of Member using the ‘key’ in the tuple as they key for the dictionary. Then ilike which is essentially a LIKE SQL condition.

Finally these rules are all joined together with the or_ method and filters is passed in with the asterisk to denote that it’s an argument list (rather than just one argument). You’ll notice I had to import or_ from sqlalchemy.sql.expression which was quite a revelation for me. Not surprisingly, it means that all the filters are joined with an OR condition which is what we want in this case.

Anyway, hope that helps someone. If I could have done it better, please let me know in the comments. This query business has a long way to go in my app development but this is a good start.


Categories: Flask

Tags: , ,

Be the first to comment

Flask vs Django

Deciding how you are going to put your website together is clearly an important decision. You generally need to make that decision up front and it will go on to haunt you for some time. As I’ve already mentioned I’ve decided to go with Python rather than PHP, but I’ve also chosen to use the Flask web development framework.

Generally speaking I’m firmly in the pro-framework camp. Any added ‘bloat’ that they add is firmly offset by the time one saves. Moreover, there’s a certain amount of forethought that goes into any half decent framework that the developer greatly benefits from.

I’ve fiddled around the edges with a range of frameworks for a while now. I think I’ve done the Django tutorial three or four times now. I’ve also had a tinker with Web2py, Bottle and Flask and have ultimately ended up really loving Flask.

But firstly, why not Django? I really like Django, it’s hard not to. It’s full MVC which works really well for me having used a few MVC frameworks in PHP. It also just takes care of all sorts of things like models and users/logins out of the box which is a massive plus.

But for this project I basically want to build an API, at least on the server/Python side of things. Javascript will take care of the front end, I just need a web app that outputs JSON.

In the end I settled on Flask. From when I first started looking at it to now it really seems to have taken off. The other micro-frameworks seem dated in comparison (not sure if that’s actually true). It just gets out of your way and lets you crack on with things. You can structure your app however you want and there are plenty of addons for it – for example SQLAlchemy which certainly makes database management a lot easier.

It’s also just super quick to throw something together with. When I wanted to put some photos on a map after a holiday in Scotland, it took me a few hours to throw it together (code here). I wanted to do a few things with Soundcloud like add filtering – a few hours later a friend and I had build a little app that meant we could filter by track duration amongst other things (code here).

The biggest downside though is that the documentation for all these plugins (and even Flask itself sometimes) is quite poor and has considerable gaps. Which is one of the reasons I’m writing this blog – I’ll share what I find difficult to find the answer to in the hope that it helps someone else.

So there you have it. I’d like to know more about Python frameworks. Are there MVC alternatives to Django for example? What are the other big players out there?


Categories: Flask, Python

Tags: , , , , , , ,

Be the first to comment

Python vs PHP

One of the reasons I thought I’d start this blog is because I’m using the Your Voice in House project to teach myself a few new tricks. As I’m sure any developer will tell you, having a little project is the best way to learn.

I’ve been coding since the early 2000s. Self taught, my primary programming language has always been PHP although I added Javascript to my tool-belt early on in the piece (when jQuery was a tiny baby and there was much deliberation about whether it was worth adding that much overhead to your site). When I first started using PHP it was version 4 and I remember feeling a little put out when I had to get my head around this Object Oriented stuff they introduced in 5. How far I have come.

PHP has plenty of detractors and there is no doubt, like every programming language, it has its short comings. But I’m not a PHP naysayer and I don’t want this post to get into the ‘is PHP even worthy of being called a programming language’ debate. Most anti-PHP sentiment is out of date these days anyway. The language has come a very long way – particularly in the last 5 years or so.

For me, particularly as someone learning to program 10 or 12 years ago, the greatest thing about PHP was the LAMP stack. It just makes everything easy. You put a file in a directory on the web root and when you navigate to that file, that file is run and you get something on the screen – including error messages. In addition, even the most basic hosting will already have a MySQL database and PHP just plugs right in there as well with a minimum of fuss. That is ideal when you’re just learning and don’t want to have to get your head around hosting environments before you get into the code.

The other fantastic thing about PHP is that it has a huge – and largely friendly – user base which means help is never far away and for the most part the documentation is excellent. I still visit php.net just about every day.

But if you’ve been over to Github, you’ll notice that I’m primarily building this site in Python.

I’ve been playing around with Python for a few years now and thought I may as well use it for this project so I can get to know it a bit better. I’m still picking it up and just need to spend a decent amount of time with it to get to know it a bit better.

Python is fantastic. It’s very powerful, syntactically very pretty and it’s also much more of a general purpose language than PHP is. I know you can use PHP for more than just building websites but it just feels at home on a website. Python on the other hand is happy anywhere and I find myself using it more and more to automate processes on my laptop amongst other things.

Plus it makes a whole bunch of other things easier. I love the BeautifulSoup library that make scraping a website – which will be important for the YVIH site – so much easier. For the most part PHP is lacking that sort of thing. I know composer is changing that but I feel like that’s still a long way off what Python has to offer in that respect.

I don’t know if Python will ever be my primary language and I’m nervous about taking that step in may senses. I’m not yet game to use it for a freelance job and I do wonder if it will ever be my primary language at work.

Whatever the case, this blog will be focusing on Python (and some Javascript). Hopefully I’ll be able to share a few tips as I learn as well.


Categories: Coding, Python

Tags: , ,

1 comment

New Beginings

So, first posts hey. Hello World…

Many moons ago I founded the now defunct Australian Centre for Democracy and Justice. We were a small activist organisation that ran a range of campaigns and regular talks on a wide range of topics.

One of the campaigns we ran was ‘Your Voice in House’ which was a database of every politician in Australia. Users, could search for their local member or a certain minister and use the site to get in contact with them.

The biggest problem with the site was that it required a lot of maintenance. Politicians change contact details and portfolios all the time. Multiply that by 8 State and Territory governments plus a Federal government and it soon became unwieldy and ultimately I gave up. The process was manual and I had no real way of knowing if any of the data was out of date.

Fastforward quite a few years, I’m living in London, working as a web developer and still think it’s a good idea. So I start plugging away at it again.

So I’ve got a start on it, you can find the repository at Github (pull requests welcome), and I thought it probably wasn’t a bad idea to document the process.

There’s a lot of learning that I’m doing in a range of fields and here I am laying it bare in the hope that it’s vaguely interesting – maybe even helpful to some one.

Progress will be slow as I do this in my spare time and posts will probably be infrequent – are your expectations low enough yet?


Categories: Musing

Be the first to comment