Data science and us - part I
Updated: Apr 25, 2020
Data science seems to have become a catch-all phrase for what is actually many kinds of skills - some related and some disparate. Individual skills all have values but when combined, they can be more than the sum of their parts. Properly applied, with an understanding of the entirety of the problem, data science can help you make decisions to achieve the outcomes you want. In this post, we attempt to break down the field as we see it, define the three broad areas of expertise in it, and explain how we are different to other data science ventures.
This is part I of a three-part series, where we will break down the skills used in the data science field, explain where they are useful, and point out their limitations.
Domain experts can come from an academic or an industry background, or they can have experience in both. The value of each type of experience varies from industry to industry, Traditionally domain experts have been called upon to make decisions on a wide range of things such as:
Can we safely increase the load rating on a dragline?
Should our managed fund buy a particular stock or not?
What fire danger rating should we broadcast for tomorrow?
These decisions are made using both sophisticated analyses of small datasets and some form of educated guess. In cases where data is difficult to find or the system is simple and well-understood enough to model analytically, a domain expert typically represents the best chance of achieving a good outcome.
All modern industries are data-driven. This push to lean heavily on data to influence decision-making has come about mainly due to two developments:
Modern electronics, IT and networking have enabled the collection of massive amounts of data for relatively little money. We are now creating more data per year than ever before. In 2018, IBM estimated that 90% of the data on the internet had been created in the last two years . Between 2014 and 2017, mobile device data quadrupled from 2 exabytes to 8 exabytes  (an exabyte is 1 billion GB). The figure below shows both the growth of overall data and also the increasing percentage of what is known as unstructured data - data that cannot be easily fit into a table format. This is the sort of data that has hitherto been very difficult to handle.
Competition has forced companies to run leaner and optimise further. Competition from countries with low labour costs has forced manufacturing industries to increase automation, saving costs and improving quality. Competition between cell phone manufacturers has led to innovative uses of materials to optimise phone packaging.
Managing streams of unstructured data from generation to final use is the job of a data engineer. Data pipelines have to be created to populate databases. Databases have to be maintained and data quality and integrity assured. These databases can then be queried to extract the relevant records for data analysts to visualise via a BI app such as Tableau.
In today's world, any company that does not have a data engineer to ingest, store, maintain, and manipulate data will be left behind by those that use data to better serve their customers.
Having mountains of data serves little purpose without the ability to use this data to generate meaningful insights that drive business decisions. Sometimes the data tells its own story and gaining insights from it is simple. Often however, particularly with large or multi-variate datasets, a quick glance over a scatter plot or a bar chart is insufficient to help decide which part of the business is most worth spending time and money on. Sometimes the question is more about the data you don't have and what effect its absence has on your business. Statistical techniques can be of much use in these situations.
A well-designed A/B test can accurately judge whether website design A or website design B has a higher conversion rate from visitor to customer
Robust regression models can forecast future profits and enable early planning for capital-heavy changes
P-values and t-tests can determine if the drug being tested has a statistically significant effect on the test subjects or if any effects observed could be explained by chance
The ability to statistically analyse and visualise large amounts of data is critical to making decisions that are robust to expected variations in markets, weather, or materials.
 Bart Custers, H. Herik, Cees Laat, Michel Rademaker, and Cor Veenman. Enabling Big Data Applications for Security: Responsible by Design. 03-2017
 Bernard Marr. How much data do we create every day? The mind-blowing stats everyone should read, Sep 2019. URL https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/c64dc360ba99
 Jeff Schultz. Micro focus blog, Jun 2019. URL https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day