XClose

Transport@UCL

Home
Menu

Blackwall Tunnel

How careful, high-level testing can help remove the effects of outliers from large datasets


Large amounts of data are becoming available through many technologies such as ANPR (automatic number-plate recognition). In particular, the travel time of a vehicle can be estimated from the time that elapses between when it enters and leaves an area, by matching the plate images at suitable boundary points: the large volumes of data available then enable the journey time distribution to be estimated.

Having encountered widely varying matching times when analysing ANPR matches over distances of a kilometre or more (when drivers might quite reasonably have interrupted their journeys), Professor Ben Heydecker chose the more controlled environment of the Blackwall Tunnel: only one way in, only one way out and no stopping for newspapers on the way.

The three fitted components of the travel time distribution: 1 - normal journeys, 2 - slow journeys and 3 - spurious matches. The composite distribution consistes of a weighted combination of the three components.



Nonetheless, when journey time data were analysed, an initial attempt to fit a log-normal distribution to the travel time data showed a large discrepancy in variation. So Ben introduced additional components until the law of diminishing returns set in. At this point something interesting appeared: a small number of the records seemed to be associated with a component whose widely-spread distribution included some implausibly short journey times and some implausibly long ones. Ben reckons that these most likely arise from spurious matches or other miscellaneous data errors; some of the longer journeys could in fact relate to a vehicle getting stuck.

His conclusion? By removing the component that represents these extreme observations, one can filter out the effects of the spurious data and so achieve a more reliable picture of typical journey times through the tunnel. This means filtering the population rather than individual records. The result here is that instead of a standard deviation of 61 seconds (as indicated by the raw data), Ben suggests the figure is nearer 30 seconds for normal journeys.

The moral of the story? Whilst "big data" offers huge opportunities, there isn't generally either the resource or the opportunity to inspect individual records, which means that more sophisticated methods are needed to identify and remove the effects of gremlins.


Ben Heydecker
is an expert in mathematical statistical analysis, modelling and evaluation, and has worked extensively with data on traffic flow, travel time, road accidents and casualties.