top of page

Data Driven Business Solutions

We specialise in Data Science and AI solutions for complex engineering and business problems.

  • Rahul Rao

Energy efficiency in buildings - the case for AI

This post draws heavily from a recent technical paper published by Deep Blue AI. You can find the paper here.

Deep Blue AI Technical Report


With the indisputable evidence of climate change growing, the world is investing heavily in renewable sources of energy to supplant fossil fuels. Solar, wind, wave, hydro and geothermal energies have all been posited as clean replacements for coal and oil. Already significant strides have been made to integrate such energy sources into our lives.

A solution to mitigate climate change that can operate in parallel to the transition to renewable sources is a reduction in energy usage. From decreasing fuel consumption in passenger vehicles to the adoption of energy-efficient technologies such as LED lighting, there is real movement from both industry and consumers to reduce their carbon footprint.

Energy consumption is generally split into three categories when reported out: industry, transport, and ‘other’. Given the reality that a significant fraction of the energy use in the ‘other’ category is in buildings for lighting, heating, and cooling, there are arguments that building energy use should become its own category. In the UK and the EU in 2004, building energy accounted for between 37% and 39% of all energy consumption.

This realisation has spawned many studies on building energy use (see here, here, here and here). These studies have historically been based on modelling the physics of heat transfer, efficiencies of HVAC systems, and historical usage patterns. Software packages such as eQuest, Ecotect, and EnergyPlus calculate building energy requirements based on parametric models with detailed data on both the proposed building and the local climate.

Such modelling techniques struggle when required to account for complex interactions between systems, systems with multiple modes of operation, and evolving usage patterns of buildings. To address these shortcomings, researchers have turned to artificial intelligence (AI) algorithms that are more suited to complex modelling.

These studies have largely been academic in nature, given the constantly evolving field of AI and the difficulty in obtaining sufficient quantities of real-time data. The recent growth in big data and internet-of-things (IoT) makes this a field likely to move from the research space to the application space in the near future.

This technical paper uses data from the American Society of Heating, Refrigeration and Air-Conditioning Engineers (ASHRAE) to model energy usage and understand the parameters affecting it.


The ASHRAE data consists of three datasets:

  • Dataset A — building metadata, including building ID, site ID, primary use of building, gross floor area, year of build, and number of floors in the building.

  • Dataset B — average building energy usage rate at hourly intervals.

  • Dataset C — hourly weather conditions at each site, including air temperature, cloud coverage, dew temperature, precipitation, ambient pressure, wind direction, and wind speed.

These datasets have several thousand missing or implausible values. These values were imputed based on linear interpolation, values carried forward, or the median value, or were excluded entirely when long sequences of data were missing. Such occurrences are common when sensors fail or are absent, or system reporting protocols are changed.

Each dataset is explored independently in the below sections.

Dataset A

There are 16 sites at which data has been measured. The number of buildings at each site is shown in Figure 1.

Figure 1. Distribution of buildings across sites.

These buildings are distributed across 16 primary use categories as shown in Figure 2.

Figure 2. Distribution of buildings across primary use categories.

The dataset includes buildings built across a wide range of years, from 1900 to 2018. Figure 3 shows both this distribution; note that the dataset includes approximately 150 buildings that are less than 20 years old.

Figure 3. Distribution of buildings across year of build.

Buildings of varying sizes are represented here; Figure 4 shows the distribution between the smallest building (283 square feet) and the largest (875000 square feet).

Figure 4. Distribution of buildings across square footage.

Dataset B

This dataset has meter readings from four meters: electricity, chilled water, steam, and hot water. In this investigation only the electric meter reading is considered. Site 0 was found to have a constant electric meter reading of 0.0 from 1 January 2016 to 20 May 2016. Given the length of time without a reading, these rows were dropped entirely. There were several missing values that were imputed based on linear interpolation or carrying forward. Some meter readings from building 1099 were implausible and were discarded; possible causes include a change of metering units, a meter fault, or corruption of the storage database.

The distribution of electric meter readings is highly positively skewed (skewness = 8.71), with a much longer tail on the right side of the distribution as indicated in Figure 5. Note the logarithmic scale on the y-axis.

Figure 5. Distribution of meter readings.

Dataset C

The weather dataset contains several measurements of weather at each site at hourly intervals. Several missing values exist in this dataset — failure of weather instruments, failure to record data, or the complete absence of weather instruments are possible reasons. These values were imputed based on linear interpolation.

Air temperature for site 11 is plotted against time in Figure 6 to ensure trends are as expected. Similar plots were inspected for different sites and the same trend observed.

Figure 6. Seasonal variation of air temperature at site 11.

Data Wrangling

Information from all three datasets is critical to the accurate modelling of energy usage and feature importance. In order to accurately match rows, the following left merges were applied:

  • Dataset B with Dataset A, on primary key Building ID, to create Dataset D.

  • Dataset D with Dataset C, with Site ID and Timestamp as a composite key, to create Dataset E

Dataset E thus contains building metadata, weather information, and electricity meter reading. It is used for all following sections.

The merging of datasets is shown in Figure 7.

Figure 7. Merging of datasets to create the unified dataset E. Orange indicates primary keys used during merging.

Machine Learning

Feature Engineering

Some obvious features that intuitively affect the energy usage in a building can be engineered into this dataset. For example, on weekdays during office hours, energy use in offices and schools should be higher than on weekends or after-hours. The opposite is true for residential buildings. In addition, the common usage of the 24-hour clock does not make it clear to a machine learning algorithm that 23:59:59 and 0:00:00 are virtually the same. The hour feature is converted to two sine and cosine features.

Similar actions are performed on wind direction. The dataset contains wind direction with respect to true north. This is converted to two features - the sine and the cosine of the wind direction.

The high skewness of the target variable — meter reading — was noted previously. Machine learning algorithms learn sub-optimally with highly skewed features or target variables. This target variable was transformed to log(1 + meter reading). The resultant distribution is shown in Figure 8; skewness is reduced from 8.71 to -0.055.

Figure 8. Distribution of feature-engineered meter reading.

Test, Train, and Validation Splits

After cleaning, the dataset consists of 11.5 million rows. 30% of this (3.5 million rows) is removed to act as a holdout set. Of the remaining 8 million rows, 30% (2.4 million rows) are separated and are used as a validation set. The remaining 5.6 million rows are used as training data. This split is shown in the form of a Sankey diagram (Figure 9).

Figure 9. Data flow from original dataset to training, validation, and test datasets.

The training set is used to tune model parameters. The validation set is used as an early stopping criterion — when error on this set starts to increase, the model is overfitting.

Finally, when the trained model is evaluated, error on the test set is the preferred metric. This ensures the model is tested against data it has never been exposed to before.

Model Selection

This study has two purposes:

  • Modelling of energy usage — deep artificial neural networks have proven to be adept at modelling a variety of problems (see here, here, here, and here). Given the speed at which these algorithms can be trained and their incredible versatility, they would be a natural choice for this application. Another class of algorithms that has received some attention over the past few years is gradient boosted trees (see here, here, and here). These train somewhat slower but are capable of generating accuracy comparable to that of neural networks. Furthermore, they are particularly well-suited to categorical data.

  • Determination of parameters that most impact energy usage — explainable AI (XAI) is a hot topic these days as AI makes inroads into the legal and regulatory industries. Neural networks do not lend themselves to explainability due to their nested non-linear structures. Several studies have made pleas for the use of more transparent algorithms to build trust with the public. In this respect, tree-based models come out on top.

Given these two requirements, a gradient-boosted tree was chosen for this study.

Model Architecture

Given the large number of independent variables, 1280 leaves were chosen per tree. A large number of leaves such as this increases the chances of overfitting. To mitigate against overfitting, L2 regularisation and feature subsampling are used. The L2 regularisation coefficient was set to 2.0 and 15% of features on each tree node are randomly discarded.

Four categorical columns are declared for the model:

  • Building ID

  • Site ID

  • Building primary use

  • Day of the week

The optimisation metric chosen is the RMS error of the target variable. Note that the target variable is 1 + log(meter reading).


The gradient-boosted tree model is trained for 1000 epochs with a learning rate of 0.05. Early stopping based on validation set error is implemented with a patience of 50 epochs.

Figure 10 shows the evolution of RMS error for the training and the validation sets as training progresses. At 1000 epochs, there are still small gains to be made in validation set error, so it is possible that further accuracy improvements can be achieved; for the purposes of this study, training is halted here.

Figure 10. Loss history during training for training and validation datasets.

Training takes approximately 4 hours on a Intel Core i5 CPU without GPU acceleration.

Results and Discussion

The RMS error on the test set is calculated to be 0.2417. A large fraction of the meter reading consists of very low numbers; for a percentage error calculation this skews the data heavily. For meter readings over 100 kW h the mean average error is approximately 9.5%. 90% of predictions are below 20% error (see Figure 11). Given the wide range of buildings and climates covered, wide range of meter readings modelled, and the frequency of missing values in the dataset, we consider this to be a promising result for further study.

Figure 11. Fraction of predictions below a given error threshold.

The second objective of this study is to determine the importance of different

features to the energy use of a building. Figure 12 shows the top ten features,

ranked by importance.

Figure 12. Modelled feature importance for the top ten features.

Building ID is by a small distance the most important determinant of energy use. Although this is meant to be simply a unique identifier for each building, it clearly encodes important properties of the building that are different to year of build, square footage, and number of floors. Some candidates that come to mind are building orientation and surroundings, insulation, ventilation, lighting technology, appliance efficiency, and building materials. Incorporating some of these parameters would help reduce the black-box nature of Building ID and give a clearer picture of the factors affecting energy use.

Air temperature and dew temperature are both also significant determinants of energy use. These are not surprising as heating and cooling form a large fraction of building energy requirements.

Sea level pressure is another important factor. The reasons for this are outside the scope of this study but this could be extracted from the dataset. Possible interpretations are increased heating needs at higher altitudes, or the correlation between low barometric pressure and changing weather, which could necessitate switching between heating and cooling.

Wind direction and speed are surprisingly also high on the list — the buildings in the dataset are all in the northern hemisphere where northerly winds are likely to be colder and southerly winds to be warmer. The preponderance of buildings greater than fifty years old, where weather sealing is likely to be poor, could explain the significance of wind direction and speed.

Building size, unsurprisingly, makes an appearance here. Cloud coverage affects

solar loading, so its inclusion is also understandable.

The time of day and day of week round out the top ten, showcasing the importance of feature engineering.

The list of features here are by no means exhaustive, but they provide an indication of the biggest levers that can be pulled to reduce energy consumption in buildings. Some can be pulled in real-time via predictive analytics, to take advantage of weather forecasts and plan energy usage for the next 24-48 hours. Some are fundamental to construction and must be determined at that time.

Conclusion and Future Work

This study investigated an open source dataset relating to building energy usage. It involved data wrangling to correctly merge three datasets, imputation of missing values, and the engineering of features thought to be relevant.

A gradient-boosted tree algorithm was trained on an appropriate fraction of the resulting dataset using early stopping on a validation set. 90% of predictions from the holdout test set were found to be within 20% error.

The trained algorithm was used to compute feature importance. Some surprising inclusions in the top ten most important features were found, namely building ID and wind speed and direction. Concrete explanations for the importance of these features requires more information on building properties and is beyond the scope of this study.

Possible future work could involve the inclusion of more building metadata to determine the components of building ID that affect energy usage. Another interesting avenue to explore is the separation of buildings by site ID, floor size, decade of build, or primary use. This would allow the determination of important features for each group of buildings, which may differ from each other.

The results from this study demonstrate the applicability of AI to building energy modelling. Possible uses are for real-time control of heating, cooling and lighting systems, or determination of important factors to consider when plans for construction are being laid out.

75 views0 comments

Recent Posts

See All
bottom of page