Guest Post: a final word on At War with Data

17 January 2017

Mark Roeling studied psychology (BSc), behaviour genetics (MSc) and genetic epidemiology (MSc) and worked as a junior researcher in the Erasmus MC Rotterdam (Netherlands). He worked as data scientist for Capgemini, where he focused on detecting fraud in e-channels in the banking sector. Now at the University of Oxford, his work has been strongly statistical and he aims to use his background to test the applicability of methods and models from (genetic) epidemiology to improve the detection of fraud and cybercrime in big data.

As part of my DPhil in Cyber Security at the University of Oxford, I focus on understanding and analysing cyber security data, as well as developing methods to detect anomalies. Currently, we are in the process of developing imputation algorithms for variables in networked data, allowing for the profile creation for any person in a network based solely on the data of linked peers without requiring any data from the person in question.

Given that strategic advantages, both during war and peacetime, have always been dependent on the availability of high quality intelligence, and that more transactions are taking place online (in cyberspace) than ever before, having the ability to collect, spread and properly analyse significant amounts of data is paramount. Therefore, At War with Data in the Data Dialogue series immediately kindled my enthusiasm, as did the opportunity to meet Thomas Rid, whose papers we discussed as part of our doctoral training.

“How open should we be with scientific reporting if adversaries can use this information against us?”

The first inspiring lecture from Prof. Charlotte Roueché illustrated how archaeologists were useful in creating and interpreting maps during the war, since they had actually been there physically. Especially because, despite the existence of accurate maps from Europe, some parts of Africa (e.g. Libya) and the Middle East were not that well documented or accurate since the location of borders was, and still is, a politically charged topic. The website http://www.oldmapsonline.org reveals the substantial variation that has occurred in borders between countries.In the cyber domain, this problem seems parallel to unclear definitions of Internet borders and related technologies. Attributing online attacks is difficult partly because an IP address in itself has limited use if the attack originated from a geographical location not equal to the country where the server was hosted. One take-home question was the extend to which researchers should openly share their data. In the context of archaeology, should we strive towards openly sharing the location of discovered religious places? Similar problems exist in other fields: should we publish medical data online to allow sharing and consortia? Can we openly share and publish new (e.g. zero day) vulnerabilities of sensitive (high integrity) systems. Essentially, how open should we be with scientific reporting if adversaries can use this information against us? This remains a difficult question to answer, since our ability to identify sensitive information also depends on our ability to judge how creative and knowledgeable adversaries can be. Truth is that there is lots of literature already available in Computer Science, Engineering and Medicine (e.g. the airborne transmission of the avian influenza A/H5N1 virus), that, with the right minds and tools, can be used to construct powerful attacks.

The second lecture from Prof. Kate Bowers provided valuable insight in the usefulness of GPS data to understand and visualise the movements of offenders. This reminded me of a study done by Deloitte which geographically map offences in Rotterdam (the Netherlands) and uses those predictions in planning police (helicopter) surveillance activities. I did not know her work but became interested in a commendable chapter from Kate Bowers and Shane Johnson regarding criminal mapping in The Handbook of Security. Apparently, criminal mapping is acknowledged to be a valuable method in crime fighting.

From a statistical point of view however, it is still doubtful whether the regression model used in the GPS study to predict position is able to accommodate to dependencies that might exist between people moving from one location to another; are these GPS observations truly independent? I had similar questions when I attended a symposium in Oxford concerning Digital Wildfires from Helena Webb and Marina Jirotka, who presented an interesting study on the characteristics of Twitter messages. Are Twitter responses independent? Regression models are usually robust against slight violations of assumptions, but nevertheless I find it inspiring that the technical data collected nowadays still requires the development of statistical methods to ensure valid inferences.The last two topics from Prof. Robert Steward and Prof. Thomas Rid were also engaging. Sharing medical information is difficult given the sensitivity of the data, but also necessary to allow large collaborations to find small and subtle effects that could not be found in small-scale analyses. Using an anonymised health record is an interesting perspective, because so far I had only considered a new polymorphic encryption and pseudonymisation method published by Eric Verheul and colleagues . Prof. Rid discussed how phishing attacks could influence national elections with a case study. The presented material was of good quality and Prof. Rid is an excellent speaker. These types of spamming attacks are common, and with my previous brief experience in industry, the sophistication and consequences of phishing attacks are no surprise.

Overall, The Data Dialogue: At War with Data was an afternoon well spent, with high quality research presented, encouraging me and others to strive towards understanding some of the remaining questions in our future work.