Data Driven Business Solutions

We specialise in Data Science and AI solutions for complex engineering and business problems.

  • Rahul Rao

Data science and us - part II

Data science seems to have become a catch-all phrase for what is actually many kinds of skills - some related and some disparate. Every skill has value but when combined, they can be more than the sum of their parts. Properly applied, with an understanding of the entirety of the problem, data science is a powerful tool. In this post we attempt to break down the field as we see it, define its three broad areas of expertise, and explain how we're different to other data science ventures.

This is part II of a three-part series, where we explore the gains to be made by operating at the intersection of two of the three main data science skills. To read part 1, click here.


Traditional decision making methods

Traditional decision making methods rely on some domain knowledge and expertise in statistical analysis of small sets of data. With simple systems, domain experts can write analytical models to predict system behaviour given a set of inputs. These models can then be tested against data collected from the physical system. Sometimes collecting data from the system is prohibitively expensive and instead a simplified or scaled-down system is used or historical data may be available. Examples of such applications are:

  • Running FEA simulations on a CAD model of an excavator bucket and comparing failure loads to an appropriately scaled model.

  • Predicting the global rise in temperature assuming a given emissions profile by modelling GHG production and heat retention.

  • Predicting the transmissibility of COVID-19 using chemical kinetic modelling ( see our earlier blog post)

In the age before big data, traditional decision making was the only avenue open to businesses and it served its purpose. However with the availability of massive datasets for many problems in today's world, traditional methods only scratch the surface of what can be done. To unlock the true power of the vast streams of data all companies have access to, traditional methods must be augmented with modern techniques.


Non-statistical decision making

Decisions where different outcomes are clearly separable and can be determined with a simple scatter plot or histogram can sometimes be made without detailed statistical analysis. However for more nuanced or multivariate problems, a lack of statistical rigour introduces significantly more risk. In industries where the cost of failure is high, focusing on simple statistical measures is often not the optimal solution. Instead, outcome distributions must be carefully analysed to reduce the probability of failure. Examples of these are:

  • Equipment manufacturers do not seek to maximise time before hardware failure on their products; instead they seem to minimise the number of products that fail before their end-of-life. Although the outcomes of interest (product longevity) is the same, the resultant actions may be very different.

  • Predictive maintenance solutions do not optimise for highest accuracy; instead they seek to minimise unplanned downtime. In either case the outcome of interest is the time when maintenance is due, but the first case leads to approximately 50% of overdue maintenance. Statistical rigour reduces this to an acceptable level.

This region of decision making is potentially dangerous and is rarely suitable for business plans, particularly where failure is costly.


Machine learning methods

Applied statistics and mathematics when combined with data engineering, leads to machine learning which is an incredibly powerful tool to discover patterns in data and determine what changes to make to achieve desired outcomes. Machine learning has been employed extensively to find some rather surprising results which can be explained a posteriori.

  • Walmart found that strawberry pop-tart sales increased seven-fold before a hurricane [1]. Do people stock up on pop-tarts assuming they will be confined at home during the hurricane? Now Walmart knows that they should increase the supply of strawberry pop-tarts to meet increased demand in hurricane season.

  • Uber found that areas with large numbers of Uber trips also have higher crime rates [2]. What root cause leads to both observations being positively correlated? Could insights from Uber help police departments reduce crime?

Without domain knowledge or additional data, the questions above are difficult to answer. Furthermore, a lack of domain knowledge has led to some dangerous biases in machine learning algorithms:

  • Equally sick people have been assigned lower risk scores by a healthcare algorithm in the US on the basis of race [3]. Investigation found that this was because the algorithm used annual healthcare spending as a proxy for sickness. Appropriate domain knowledge would have recognised the flaw in this assumption and would have chosen a more appropriate input or corrected spending for race.

  • An Amazon machine learning algorithm used to filter out resumes when hiring has been found to discriminate unfairly against women [4]. Historical analysis of the types of resumes accepted showed a bias towards male applicants as historically most workplaces have been male-dominated. A domain expert would have realised this and corrected for it in the data input to the algorithm.

Machine learning is a double-edged sword - with inputs from human experts it can provide deep insights into how the world works; without, it could lead a user down the wrong path and lead to unintended and undesirable outcomes.


[1] Constance L. Hays. What Wal-mart knows about customers' habits, Nov 2004.

[2] Lianne Yvkoff. Neighborhoods with more crime have more Uber rides. Sep 2011.

[3] Linda Carroll. Widely-used healthcare algorithm racially biased. Oct 2019.

[4] Jeffrey Dastin. Amazon scraps secret AI recruiting tool that showed bias against women. Oct 2018.

30 views0 comments

Recent Posts

See All