Industrial Permit Processing Towards a semantic map of Sheffield’s industrial polluters
As a part of the AirQuality+ project we are investigating the wider datasets associated with Air Quality in and around the city region. One promising avenue of research concerns the permits issued to organisations governing the processing of pollutants. This information is already published in narrative form via Sheffield City Council’s public register. We’re working closely with the council on this dataset and one common interest is a visual map of permits in the region.
Getting hold of, and maintaining, the data for this mapping exercise itself raises a number of interesting issues. Information managers tend to think of data processes in linear lines from authoritative source to data publisher. The reality, however, seems much closer to the Usenet news ‘pouring water on a table‘ metaphor. Information leaks out in many ways and people will use whatever they can in a variety of interesting and innovative ways.
We’re exploring this by trying to find the most efficient way to collect this data together. On one line of attack we’re getting information directly from the relevant body, collating a document containing all the information.
Another line of attack is to take the information published on the council website and attempt to automate the process of extracting meaningful information from the website and the documents itself. Our first efforts of this can be seen here in a CSV file containing a list of all the A2 and B processes, along with whatever we have been able to extract. The extraction code is here [pretty it is NOT].
There are some key lessons to learn about the scraping process that might provide illuminating feedback to the council.