Author: Jan Camenzind Broomby
For the purposes of my King’s Undergraduate Research Fellowship I assisted both Professor Claudia Aradau (KCL War Studies) and Professor Tobias Blanke (KCL Digital Humanities) in their research on intelligence and oversight in the UK parliamentary archives, with a focus on how questions of accountability and scrutiny have been raised in regard to the British intelligence activities. I was primarily tasked with cleaning the data that had been compiled thus far and conducting archival research to ensure that said data was relevant and correct. This involved harmonising the names of entities (the data that had been collected previously) under new names, discarding irrelevant entities, adding latitudes/longitudes for entities that referred to locations, and Wikipedia articles for entities that referred to events. This blog will discuss the methodology I adopted during my work and for the sake of clarity I have split this into several segments, each of which outlines a challenge that I encountered, and a solution I developed thereto.
How to begin?
I began by working on cleaning the data in the smallest Excel file that had been sent to me. This ensured that any mistakes I might make would be contained to a smaller, more manageable Excel file and would thus be easier to rectify. It also ensured that the task appeared less daunting and enabled me to get to grips with what I was doing without being overwhelmed. From the sheer size of the data that I was dealing with, I knew that I would have to automate a portion of the work in order to ensure that I was able to work as efficiently as possible. I therefore began looking into the different Excel commands that might be of benefit to me and discovered that the IFS command could be used to automatically fill cells with the harmonised names that I had chosen. Using this command, I set about populating the ‘Harmonised’ column with the harmonised names for each entity. Having done this I wrote another IFS command that filled another column with relevant Wikipedia articles depending on the harmonised name that the article was to refer to. Given the small size of this document I was able to write an IFS command that included all unique entities. This sped the progress of work up considerably.
World War or Wars?
One issue I stumbled across early on was the decision of how to harmonise certain entities. This problem was particularly prevalent when I came across several entities that referred generally to ‘World Wars’. Unsure of what to do, I emailed my supervisors and was advised that I could either harmonise this entity under a ‘World Wars’ name or create two separate rows for both World War I & II. Having realised that there was no Wikipedia article for ‘World Wars’, and that whenever a speech referenced ‘World Wars’, they also referred to World War I & II separately, I decided I would create new entries for both World War I & II whenever I came across the entity ‘World Wars’. I adopted the same solution for any other such issues I encountered, searching first to see whether a relevant Wikipedia article could be found, and secondly determining whether the entity could be split into smaller sub-events.
Having finished the file that focused on events, I moved onto the file that dealt with locations. I noted that this file was far larger and would therefore require more automation to finish. I gave a cursory glance to the whole dataset and noted down which entities seemed to be cropping up the most. I took this list and, using the IFS command, automated the harmonisation of these entities. While this was a step in the right direction, I nevertheless noticed that there were several entities that came up frequently that I had failed to make note of when I first looked through the document, these therefore had to be harmonised manually which took longer than it otherwise might have done.
Constituency or location?
The next issue I ran into was that a large portion of the locations that were included as entities were actually references to members of Parliament (who are often referred to by their constituency, rather than their names). These were unwanted as they generally did not refer to relevant locations, however the difficulty was that I could not be sure whether or not to discard the entity without checking each one within the Hansard archive.
I noticed also that the algorithm, when making reference to an MP, would often include their City, followed by one of the four cardinal compass directions (North, East, South, or West). The honourable member for Sunderland, South was, for example, often represented on the Excel sheet with one ‘Sunderland’ entity, and one ‘South’ entity following it. Given that there was a low chance that the South of Sunderland would be relevant to a GCHQ debate, and given that people colloquially tend to refer to areas with the compass direction first (one might say ‘the South of Sunderland’), rather than afterwards (one would not say ‘Sunderland, South’), I could thus remove pairs of entities that I knew followed this pattern of town, cardinal direction.
To read or not to read?
I then set about manually cleaning the remaining data but happened upon a further problem. When I went to check the Hansard archives to determine what was being referred to in the entities, I was often confronted with hundreds of different mentions of the entity. The entity might for example be listed as ‘West’ but upon searching within the Hansard archive I would find that ‘West’ had been referred to in the context of West Berlin, West Germany, The West, West Wales etc. and I could never be sure which of these ‘West’s was the one that should be recorded. The issue was that I had been checking the whole Hansard archive of each date, which meant that far more irrelevant results showed up.
I emailed my supervisors and they explained that the programme that had been used to extract entities from the Hansard archive material had only looked for locations, events and organisations in speeches that mentioned GCHQ, as well as 2/3 speeches before and after. This meant that I did not have to look through the entire Hansard archive. My supervisors also sent me a new filed named ‘gchq_speeches’. This massively sped up the process of manually checking entities as it allowed me to search within the gchq_speeches file, rather than searching on the Hansard archives website.
I finally moved on to the second location file. This was much larger than the two files I had been working on so far. It thus required more automation to ensure that I could get the work done on time. I knew that it made sense to automate the harmonisation of entities that appeared most frequently, but I had yet to figure out how I could determine which entities appeared the most. While I could have used Excel’s COUNTIF function, this would have involved me manually inputting each unique entity to see how many times it appeared and would thus not have saved me as much time.
I decided instead to use an online word occurrence tool. I quickly realised, however, that this tool counted each word separately. This caused an issue as many of the entities in my Excel file contained multiple words within one cell. I therefore found a different tool that would allow me to remove the spaces between the words in each cell. I could then take the output from the space removal tool and input that to the word frequency tool, which generated a list of the entities that most frequently came up in the file.
This list of the most frequent words allowed me to automate the harmonisation of entities using the IFS command in Excel more efficiently than I had been doing before. I still, however, was limited by the fact that Excel only allows 125 tests within one excel command in a cell. This meant that I was only able to automate the first 125 unique entities. However, this was by far the most efficient way of cleaning the data, and it enabled me to sort through over 5,000 data points considerably faster than I otherwise would have done.
As is evident from this piece it took me several iterations and adaptions to reach a point where I was satisfied with my workflow. However, having finished my undergraduate fellowship, I am aware that there are still numerous areas that could still be improved upon. I suppose that there are several things I have learned from this experience. It has borne testament to the value of trial and error, persistence, and to the fact that, as my supervisor Professor Blanke pointed out to me, ‘most knowledge is serendipitous in its origins to a degree’. Challenges should thus not be seen as an immovable obstacle, but as an opportunity to learn, and adapting and overcoming these challenges is an important part of learning and of the research process.
This article was written as part of the King’s Undergraduate Research fellowship held in the context of the GUARDINT project, which is supported by the Economic and Social Research Council.