Translator Disclaimer
31 December 2020 Temporal Aspects in Air Quality Modeling—A Case Study in Wrocław
Joanna Kamińska, Estrella Lucena-Sánchez, Guido Sciavicco
Author Affiliations +

Anthropogenic environmental pollution is a known and indisputable issue, and the importance of searching for reliable mathematical models that help understanding the underlying process is witnessed by the extensive literature on the topic. In this article, we focus on the temporal aspects of the processes that govern the concentration of pollutants using typical explanatory variables, such as meteorological values and traffic flows. We develop a novel technique based on multiobjective optimization and linear regression to find optimal delays for each variable, and then we apply such delays to our data to evaluate the improvement that can be obtained with respect to learning an explanatory model with standard techniques. We found that optimizing delays can, in some cases, improve the accuracy of the final model up to 15%.


Anthropogenic environmental pollution is a known and indisputable issue. Every day, the human body is exposed to harmful substances in various ways, including by ingestion of food and drink and by absorbtion with breathing. While it is possible to limit the ingestion of harmful chemicals, it is impossible to choose the air we breathe. Chemicals dangerous to health (or that become so if absorbed over certain amounts) include suspended particulate matters, nitrogen oxides, CO, and SO2. Exposure to these pollutants adversely affects human health, as confirmed by numerous scientific studies conducted in recent years around the world: Poland,1,2 Deutschland,3 Italy,4 USA,57 Canada,8 Australia,9 Chile,10 among many others. While the quality of air is usually regularly monitored, actions leading to the reduction of air pollution are most definitely a growing need. Generally, anthropogenic sources of air pollution are known, and, due to the development of civilization, it is impossible to completely eliminate them. Therefore, scientific studies are conducted to determine the impact of factors that modify their concentrations via transformation, retention, or evacuation, and, to this purpose, mathematical models are an essential tool. Recognizing the factors that have the greatest impact on the concentration of pollutants in the air at a certain moment gives the opportunity to build plans for their manipulation, when possible, or, at least, for the improvement of design of cities, streets, crossroads, and houses, to ensure the fastest possible evacuation of contaminants, shorten the time of exposure to its harmful effects, and reduce the intensity of their action. In general, such models are known as land use regression models—LUR (see previous studies,11,12 among many others). However, in addition to the identification of the factors themselves, the timing of their effect also plays a significant role. For example, a momentary gust of wind results in a much smaller evacuation of pollutants than a wind with the same speed persisting for several hours. Therefore, it is important to identify not only the factors in play, but also the moment in time in which their influence on the concentrations of pollution is the greatest, to obtain a temporal land use regression model (TLUR).

Predicting the value of a time series, such as the concentration of a certain pollutant in time, is a very well-known problem. The usual statistical approach to its solution requires the use of moving average models, or, more generally, of Autoregressive Integrated Moving Average (ARIMA) models, that define the problem as an autoregressive one (see Box et al13 for a comprehensive introduction). The underlying idea is that the past values of the predicted variable, for example, the pollution concentration, can be used to predict future values. The implicit assumption is that the behavior of the predicted variable remains somewhat constant, and such models are designed to identify, and describe, such behavior. A different approach, more typical of machine learning, consists of enhancing a static model learner (eg, linear regression, tree regression, neural network) with lagged variables, that is, variables with the past values of the predictors. A classical regression model is designed to extract a (implicit or explicit) function of the (current) predictors that explains the (current) value of the interesting variable; a lagged regression model acts in the same way, but using, as predictors, both the current and the past values of the explanatory variables. The so-called ARIMAX models mix these 2 ideas, and are designed to extract functions of both the current and past values of both predicting and predicted variables. The most relevant drawback of ARIMA-type frameworks is that the resulting models are, in a way, implicit. As a matter of fact, in most natural phenomena, the variable to be predicted has enough slow-changing behavior for the past values to be very good predictors, so good that the real predictors cease to have relevant roles. As a consequence, with ARIMA-type methods, one obtains very good predicting models that offer no explanation of the underlying process, which cannot be used for the purposes of serving as basis for (T)LURs. Pure lagged models, such as those offered by typical learning packages (see Hall,14 for example), bypass this problem when the past values of the predicted variable are not used, but raise other issues. In particular, a typical lagged model (both explicit or implicit) uses tens of lagged variables per each explanatory parameter; this proliferation of columns makes it very difficult, or impossible, to interpret the underlying process even when explicit functions are used. As a consequence, the most successful lagged models are based on neural networks, which are already implicit. The form of the regressed function is, per se, an additional problem of classical approaches. ARIMA-type models and linear regression are designed to extract a linear function, which is, typically, a good first approximation of the underlying behavior. As we have recalled, lagged models can be also tree-like, that is, layered functions, or even highly nonlinear, such as in neural networks. But these approaches do not allow to explicitly model the nonlinearity level, so to say, in search for some simple, albeit nonlinear, behavior.15

In this article, we present a methodology to select, at the same time: (1) predicting variables, (2) the amount of their lag, and, if necessary, (3) their nonlinear contribution. By using a multiobjective optimization algorithm, we produce a set of potential solutions. In some sense, each solution can be thought of as a temporal convolution vector that highlights the contribution of each predictor by taking into account the temporal component and their nonlinear contribution. Any such vectors can be applied to the original data, to test the effect of the transformation with different regression algorithms. A temporal convolution vector contains an optimal lag and an optimal nonlinear transformation for each variable, and it not only allows the induction of a better explanation model, but it is also interpretable per se, as it shows exactly the delay after which each predictor is most influential. Optimizing lags solves, in a way, both problems of ARIMA-type models and lagged models: the temporal convolution vector allows one to better understand the necessary delay for an explanatory variable to take effect, and, if it is the case, its nonlinear contribution.

To test our methodology, we considered a data set containing NO2 and NOx concentrations measured hourly from 2015 to 2017 by a monitoring station located in Wrocław, Poland, along with a set of meteorological and vehicle traffic data. As we shall see, applying a temporal convolution vector results in a sensible improvement in the performances of the learned model, showing that, in fact, delays and nonlinear contributions can be taken into account without losing the interpretability of the model. We tested out methodology with 3 types of learners: linear regression, random forest, and multilayer perceptron, and in all 3 cases, we found an improvement in the performances of the learned model.


Mathematical models for air quality prediction and explanation

The relationships between concentrations of air pollutants in the urban agglomeration and meteorological factors, traffic flow, and other elements of the environment have been described using many different modeling techniques. However, taking into account the interpretative usefulness of the results obtained, they can be categorized into (1) interpretable models, including explicit function (linear, polynomial, exponential, and other nonlinear functions),1619 clustered models,20 and probabilistic models,2123 and (2) noninterpretable models, including those based on neutral networks,24 forests of decision trees or regression trees,2527 and ensemble models.28,29

The relationships between concentrations of air pollutants, traffic flow, and meteorological conditions based on the same hourly data used in this article were already analyzed. An interpretable simple probabilistic model was built for NO2 concentrations based on traffic flow and wind speed in 2015 to 2017.30 Data from 2015 to 2016 (not including solar radiation) were used to create models based on a random forest for NO2, NOx, and PM25.21 A total of 9 models were built, each for a different subset of data (warm and cold season, working/nonworking days, and all data for each pollutant). The best result obtained was R2=057 and 0.52 for NO2 and NOx, respectively. In Kamińska,25 the author presents a modified model for NO2 concentrations also based on the concept of random forest using data from 2015 to 2017, obtaining a prediction of NO2 daily concentration with a determination coefficient R2=0.82, and by means of which it was possible to determine the importance of each feature on separated low and high pollution concentration. However, this result was obtained by a set of models, not a single 1; the results for the single model on all data were R2=0.60. In all cases, past values of the predictors were not taken into consideration.

Time series and lagged models

A time series is a series of data points labeled with a temporal stamp. Time series arise in multiple contexts; for example, in our case, data from environmental monitoring stations can be seen as time series, in which atmospheric values (eg, pressure, concentration of chemicals) change over time. If each data point contains a single time-dependent value, then the time series is univariate; otherwise, it is called multivariate. In our case, we refer to a multivariate time series in which precisely 1 variable is dependent; for other applications, it makes sense to consider multivariate time series with multiple dependent variables. There are 2 main problems associated with time series: time series explanation and time series forecasting; these problems are usually associated with different contexts and approached with different tools. In particular, time series forecasting emerges in the realm of statistical economics, and, more recently, has found applications to other contexts. Time series explanation, on the other hand, is related to machine learning processes, and it is not linked to a particular field of application. The different approaches to time series analysis have, clearly, nonzero overlapping.

The typical statistical-based time series forecasting approach is based on autoregressive models. The simplest univariate forecasting approach is commonly known as simple moving average (SMA) model: an SMA is calculated over the time series by considering its last n values, used to perform a smoothing process of the series, and then used to forecast for the next value. Although such an approach has some clear limitations, it is still useful to establish a baseline, against which to compare more complex solutions.13 Based on the observation that the most recent values may be more indicative of a future trend than older ones, simple exponential smoothing models consider a weighted average over the last n observations, assigning exponentially decreasing weights as they get older.13 Other than this first, simple type of smoothing, it is also worth mentioning Holt Exponential Smoothing models31 and Holt-Winters Exponential Smoothing32 models. Technically, exponential smoothing belongs to the broader ARIMA family.33 Their multivariate counterpart, the ARIMAX models, allows one to study the interaction between independent time series and dependent ones, in a similar way as lagged machine learning models do.

In the machine learning context, there are 2 influential approaches to time series analysis: recurrent neural networks3436 and lagged models. In the former case, neural networks are adapted to the specific form of a time series to be trained for forecasting. In the latter case, on the other hand, data are systematically transformed by adding a delay, so that a classical, propositional learning algorithm can be then applied; among the available packages to this purpose, we mention Weka timeseriesForecasting.14 Lagged models are flexible by nature, as they are not linked to a specific learning schema, and their focus is on time series explanation. In some cases, lagged variables have been used for neural network training, increasing their performances. While explicit models can be used for forecasting, it is not their focus, considering that their forecasting horizons are limited to the maximum lag in the model; time series forecasting models, on the other hand, allow long-term predictions, at the expenses of the interpretability of the models. Different, yet related, approaches include,37 in which a modified decision tree learner has been used to model air pollution using lagged versions of the predicting variable in the form of univariate time series. Lagged models, and more in general, models that include consideration on the past values of the predictors, have been mainly used in dealing with issues related to the analysis of factors affecting health and human life. In studies on the impact of air pollution on mortality, lagged variables are used to consider the duration of the exposure. Effect of exposure to high concentrations of particulate matter has been studied in previous studies,8,10,38 among others. The effect of exposure to ozone, including lagged variables, was analyzed, inter alia, in previous studies.39,40 As for concentration of pollution models, lagged variables are considered in forecasting models such as Catalano et al,41 but such models are definitely different from those developed in this work in many perspectives. Similarly, multiobjective optimization processes are used to solve various types of optimization problems also in the environmental context, for example, in allocation of sediment resources,42 long-term ground water monitoring systems,43 calibration of rainfall-runoff model,44 optimal location and size of the given number of check dams,45 or studying the problem of mitigating climate effects on water resources.46

Feature selection

Feature selection (FS) is defined as the process of eliminating features from the data base that are irrelevant to the task to be performed.47 Feature selection facilitates data understanding, reduces the storage requirements, and lowers the processing time, so that model learning becomes an easier process. Feature selection methods that do not incorporate dependencies between attributes are called univariate methods, and they consist in applying some criterion to each pair feature-response, and measuring the individual power of a given feature with respect to the response independently from the other features, so that each feature can be ranked accordingly. In multivariate FS, on the other hand, the assessment is performed for subsets of features rather than single features. In our case, multivariate FS may be paired with optimal lag searching, so that not only the delayed effect of each variable but also its actual need in the explanation model is taken into account. Relevant to this study are FS techniques based on multiobjective problems, reviewed in the next paragraph.

Multiobjective optimization

A multiobjective optimization problem (see Collette and Siarry48) can be formally defined as the optimization problem of simultaneously minimizing (or maximizing) a set of k arbitrary functions:


where inline-formula01-1178622120975829.gif is a vector of decision variables. A multiobjective optimization problem can be continuous, in which we look for real values, or combinatorial, we look for objects from a countably (in)finite set, typically integers, permutations, or graphs. Maximization and minimization problems can be reduced to each other, so that it is sufficient to consider 1 type only. A solution inline-formula01-1178622120975829.gif dominates a solution inline-formula02-1178622120975829.gif if and only if inline-formula01-1178622120975829.gif is better than inline-formula02-1178622120975829.gif in at least 1 objective, and it is not worse than inline-formula02-1178622120975829.gif in the remaining objectives. We say that inline-formula01-1178622120975829.gif is nondominated if and only if there is not other solution that dominates it. The set of nondominated solutions from inline-formula03-1178622120975829.gif is called Pareto front. In general, finding the Pareto optimal front of a multiobjective problem is a computationally hard task. Optimization problems are therefore usually approximated. Among the popular approximation techniques, multiobjective evolutionary algorithms are a typical choice.4952

Feature selection can be seen as a multiobjective optimization problem, in which the solution encodes the selected features, and the objective(s) are designed to maximize the performance of some classification/regression algorithm in several possible ways; this may entail, for example, instantiating equation (1) as:


where inline-formula01-1178622120975829.gif represents the chosen features and we maximize the performance of a predetermined classification/regression algorithm (on those features), while minimizing their number. The use of evolutionary algorithms for the selection of features in the design of automatic pattern classifiers was introduced in Siedlecki and Sklansky.53 A review of evolutionary techniques for FS can be found in Jiménez et al,50 and a very recent survey of multiobjective algorithms for data mining in general can be found in Mukhopadhyay et al.52 The first evolutionary approach involving multiobjective optimization for FS was proposed in Ishibuchi and Nakashima,54 and a formulation of FS as a multiobjective optimization problem has been presented in Emmanouilidis et al.51 The wrapper approach proposed in Liu and Iba55 takes into account the misclassification rate of the classifier, the difference in error rate among classes, and the size of the subset using a multiobjective evolutionary algorithm where a niche-based fitness punishing technique is proposed to preserve the diversity of the population, while the one proposed in Pappa et al56 minimizes both the error rate and the size of a decision tree. Another wrapper method is proposed in Shi et al,57 while in García-Nieto et al,58 2 wrapper methods with 3 and 2 objectives, respectively, applied to cancer diagnosis are compared. Finally, very recent examples of multiobjective FS systems can be found in Jiménez et al.59,60 To the best of our knowledge, the only attempt to use multiobjective optimization process in modeling air quality has been used for monitoring system planning in Sarigiannis and Saisana.61


There is only 1 communication station for measuring the air quality in the city of Wrocław, and it is located within a wide street with 2 lanes in each direction (GPS coordinates: 51.086390 North, 17.012076 East, see Figure 1). The center of 1 of the largest intersections in the city with 14 traffic lanes is located approximately 30 m from the measuring station, and is covered by traffic monitoring. The measurement station is located on the outskirts of the city, at 9.6 km from the airport (the distance between the weather monitoring station and the air quality station is 1 of the reasons, although not the only 1, for which time lags play a role in our model). Pollution data are collected by the Provincial Environment Protection Inspectorate and encompass the hourly NO2 and NOx concentration values during the full 3 years, from 2015 to 2017. The traffic data are provided by the Traffic Public Transport Management Department of the Roads and City Maintenance Board in Wrocław, and include hourly count of all types of vehicles passing the intersection. Public meteorological data are provided by the Institute of Meteorology and Water Management, and they include air temperature, solar duration, wind speed, relative humidity, and air pressure. The full data set contains 26 304 observations. In the preprocessing phase, each missing value (there were 617 samples, that is, 2.3%, with at least 1 missing value) has been interpolated with the 2 closest known values, immediately before and immediately after the missing one. Some basic statistic indicators on the remaining 25 687 instances are presented in Table 1, along with the symbol used in the tests for each variable.

Figure 1.

Aerial view of the monitoring station.


Table 1.

Descriptive statistics.



Lagged regression

Linear regression is the most immediate approach to explicit modeling of a multivariate time series. In our case, for example, one can denote by:


the set of preprocessed data, in which A1,…,An are the independent columns (air temperature, solar duration, wind speed, relative humidity, air pressure, and temperature), >B is the dependent one NO2 or NOx, depending on the problem we want to solve), ordered by the time of observation, and use a linear regression algorithm to extract a function of the type:


Solving this problem entails finding n+1 optimal parameters (or coefficients) c0,c1,…,cn to fit the above equation, which does not take into account past values of any independent parameter, and the temporal component is used implicitly. Lagged (linear) regression consists of solving a more general equation, formulated as:


In other words, we use the value of each independent variable Ai not only at time t but also at time t−1,t−2,…,t−pi, to explain B at time t; each Ai(t−l) is associated with a coefficient ci,l, which must be estimated, along with each maximum lag pi. There are available techniques, based on standard regression algorithms, that allow one to solve the inverse problem associated with equation (5); unfortunately, the resulting equation may result very difficult to interpret. Equation (5) is very similar to ARIMAX model, but without the autoregressive part.

Lagged regression can be performed with other, more complicate, model extraction methods that may not allow explicit representation, such as random forest or neural networks of any type. The underlying idea is the same: enhance the independent data by adding lagged versions of each variable A−i up to a predetermined maximum pi.

Optimizing a lagged model

Both standard and lagged linear regression are classical, simple approaches to the problem of explaining the value of B(t) (in our case, pollution concentrations), using current and past values of A1,…,An: having learned the best coefficients, the performances of the model are computed using standard measures. As per classical, standard data analysis procedures, the set A may be used in k-fold cross-validation mode to extract a linear model and test it at the same time, or separated into training and test subsets. In any case, solving equation (5) entails fixing the value pi for each independent variable; each past value of each independent variable may contribute in the same way to the model.

We work under the additional assumption that, for each i, there is precisely 1 lag li, such that Ai(t−li) influences the output more than any other lag; this may be reasonable in some applications, and less so in others: as we shall see, it fits perfectly our case. Under such an assumption, the model that we are assuming becomes:


Our methodology can be described as follows: (1) we split the original data set A into 2 sets that we call Atr and Amo, respecting the temporal ordering, (2) we optimize the values of c0,c1,…,cn, and l1,…,ln on Atr, obtaining a linear model that fits Atr well, and (3) we use the obtained lags to transform Amo, so that a new model can be learned. Splitting into Atr and Amo is necessary to ensure that finding the optimal lags is computationally affordable: we search for the optimal lags in a small data set (Atr), and then we apply them to a bigger one (Amo), so that a model can be learned on the latter. The model that is learned on the transformed data can be of any type: linear (or, in general, functional), tree-based, or a neural network. The transformation gives us already some information on the problem itself, as it has optimized the delay after which a certain independent variable has effect; such a learned model, obtained on A2, can be tested using classical testing modes, such as training + test (which would entail further separating A2 into 2 sets), or, as we shall do, k-fold cross-validation. Observe that Atr cannot be considered a training set per se, but, instead, a pretraining set, used with the sole purpose of performing the optimization.

Having separated the optimization part from the training part, we can now arbitrarily complicate the model. For example, in some contexts, such as pollution concentration modeling, a super-linear explanation model may fit better than a linear one, yet preserving the possibility of an intuitive interpretation. From the mathematical point of view, the inverse problem that corresponds to searching for a super-linear model is a simple generalization of equation (6):


Thus, in the optimization part of our methodology, we can optimize coefficients cis, lags lis, and exponents eis, for each predictor, and, then, apply this more elaborate transformation to the data. Once again, as a result of the optimization phase, we obtain new information; for example, we may learn that wind influences the amount of nitrogen oxide in the air at a certain moment with 2 hours of delay, and in a way proportional to the square of its strength. Even if we choose, in the second phase, to use a noninterpretable learning model (eg, random forest), we have more information on the underlying process than we would have had using the same learning model without transformation. Our purpose is to prove that such an optimization does increase the performances on the learned model in the second phase, independently from the specific learning model. A representation of this methodology can be found in Figure 2.

Figure 2.

A simple schematic of the proposed methodology.


Solving the optimization problem

Deciding the best lag and the best exponent for each variable is an optimization problem. Formally, given a multivariate time series A1(t),…,An(t),B(t) with m distinct observations (such as our data on atmospheric pollution), let us define P as the maximum lag of the problem (ie, we do not search solutions with lags greater than P—observe that in equation (5), we have that Pi=P for each i) and E as the maximum exponent of the problem (ie, we do not search solutions with exponents greater than E—observe that in equation (7), we have that eE for each i). Let us fix a vector inline-formula4-1178622120975829.gif of decision variables with domain inline-formula5-1178622120975829.gif, and let M be the maximum of inline-formula01-1178622120975829.gif (called maximum actual lag of inline-formula01-1178622120975829.gif. The vector inline-formula01-1178622120975829.gif entails a transformation of equation (3) into a new data set with m−M observations, in which the feature (time series) Ai is lagged (ie, delayed) of the amount inline-formula6-1178622120975829.gif:


The transformation equation (8) works for every inline-formula7-1178622120975829.gif. The case of inline-formula8-1178622120975829.gif is interpreted as excluding the column inline-formula9-1178622120975829.gif from the problem (entailing an implicit FS method). Similarly, let inline-formula10-1178622120975829.gif, a second vector of decision variables with domain [1,…,E]. For each variable Ai, we interpret inline-formula11-1178622120975829.gif as the exponent to which Ai is raised in equation (7). So, in conjunction, the pair inline-formula01-1178622120975829.gifinline-formula02-1178622120975829.gif entails a transformation of equation (3) into a new data set with m−M observations, in which the feature (time series) Ai is lagged (ie, delayed) of the amount inline-formula6-1178622120975829.gif, and raised to the power of inline-formula12-1178622120975829.gif:


A simple numeric example of such a transformation can be seen in Figure 3. In this example, we start off with 5 observations and 5 independent variables (the sixth column indicates the time of observation); the left-hand side (that contains the lags) of the decision variables vector contains 0 for the first and the fourth variable, 1 for the second variable, 3 for the fifth variable, and −1 for the third variable (which entails that this variable is not selected); the right-hand side (that contains the exponents) contains 1 (linear behavior) for every variable except the third one (which, by the way, is eliminated) and the fifth one, in which the exponent is 2 (quadratic behavior). The original data set, therefore, must contain 3 less observations because in this example, we are assuming that to explain the current value of B, we need to look at the value of A5 3 units of time before. The values of the selected variables, at the selected times, are then elevated to the chosen exponent, instantiating equation (9).

Figure 3.

Example of transformation. The original data set (top) is transformed by the pair inline-formula01-1178622120975829.gifinline-formula02-1178622120975829.gif (middle) into a lagged data set (bottom), with 1 less feature. There are 3 less instances due to the maximum chosen lag.


Solving the optimization problem entails evaluating a vector (inline-formula01-1178622120975829.gif,inline-formula02-1178622120975829.gif) on Atr. As per our methodology, during the optimization phase, we use an efficient learner, such as linear regression, and any standard measure of the performances of the extracted model, such as the Pearson correlation test between the function values and the actual values of inline-formula13-1178622120975829.gif; we denote such a generic function with inline-formula14-1178622120975829.gif, that is, the correlation coefficient extracted by a linear regression algorithm ran on Atr after the transformation entailed by inline-formula01-1178622120975829.gif,inline-formula02-1178622120975829.gif—correlation coefficient should be maximized. In a multiobjective context, however, we can also optimize the characteristics of both inline-formula01-1178622120975829.gif and inline-formula02-1178622120975829.gif>. On one hand, we want to minimize the number of selected variables, using:


and, on the other hand, we want to minimize the complexity of the extracted model, using:


Thus, solving our optimization problem means instantiating equation (1) with:


Clearly, other objective functions can be designed that may or may not improve the quality of the solutions.


Multiobjective evolutionary algorithms are known to be particularly suitable to perform multiobjective optimization, as they search for multiple optimal solutions in parallel. In this experiment, we have chosen the well-known Nondominated Sorted Genetic Algorithm II,49 which is available as open source from the suite jMetal.62 Nondominated Sorted Genetic Algorithm II is an elitist Pareto-based multiobjective evolutionary algorithm that employs a strategy with a binary tournament selection and a rank-crowding better function, where the rank of an individual in a population is the nondomination level of the individual in the whole population. As black box linear regression algorithm, during the optimization phase, we used the class LinearRegression from the open-source learning suite Weka,63 run in 10-fold cross-validation mode, with standard parameters and no embedded FS. We have represented each individual solution (gene) as an array:


with values in [−1,…,L] for the inline-formula01-1178622120975829.gif side and in [1,…,E] for the inline-formula02-1178622120975829.gif side.

The initial population has been generated randomly, in each execution, and the fitness functions reflect the objective functions of equation (12) in the obvious way; in particular, Weka implementation of any regression extraction algorithm returns, among other measures, the Pearson correlation coefficient of the extracted model. Mutation and crossover operations have been adapted to the form of the gene. To mutate an individual I, we proceed as follows: (1) we randomly choose between the x-part and y-part; (2) we randomly choose the index to mutate; and (3) we randomly choose to mutate the corresponding value with an increment (plus 1), decrement (minus 1), or random substitution. Increments, decrements, and random substitutions are implemented in the interval [−1,…,L] in the x-part, and in the interval [0,…,E] in the y-part. Similarly, crossover between 2 individuals I,J is implemented as follows: (1) we randomly choose between the x-part and y-part; (2) we randomly choose 2 indexes in I and 2 indexes in J; and (3) we perform the crossover on the selected indexes.


According to the methodology described in Figure 2, we prepared the data sets Atr (7706 observations, 30% of the initial data) and Amo (17 981 observations, 70% of the initial data) for 2 problems, NO2 and NO2, which we kept separated, for 2 different sets of experiments. We then run the optimization procedure with random initial population for 10 independent experiments, with seeds from 1 to 10. In the final population of each execution, we applied a simple decision-making strategy to choose 1 individual of the Pareto front, that is, the one with the best correlation coefficient, which is a natural choice. On each execution, we set the maximum lag at 6 hours (P=6), and the maximum exponent at (E=3). As it turns out, all selected solutions contain all initial variables, suggesting that all predictors do have an nontrivial role in the problem. The training results are shown in Table 2 for NO2, and in Table 3 for NOx. Both tables are structured in the same way. For each execution, we show the best correlation coefficient that we have obtained during training. Also, we show the lags, the exponents, and the coefficient for each variable, in the original order: temperature, solar radiation, wind, humidity, pressure, and traffic.

Table 2.

Training results for NO2.


Table 3.

Training results for NOX.


As for the test phase, we have operated as follows: first, we have used the original test data set (Amo), with no transformations, to run 3 different regression algorithms, and, second, we have used the same algorithms over the data transformed with each of the 10 selected solutions, to measure the difference in performance, in 10-fold cross-validation mode. The algorithms that we have used are: (1) the classic linear regression (class LinearRegression in Weka—same parameters as in the optimization phase); (2) the random forest algorithm (class RandomForest in Weka—parameters: 100 trees, 100% size per bag, minimum 1 instance per leaf, 10−3 maximum variance per leaf, unlimited depth per tree, no backfitting); (3) a multilayer perceptron (class MultilayerPerceptron in Weka—parameters: 0.3 learning rate, 0.2 momentum rate for backpropagation, 500 epochs, no validation set). Each execution of each learner has been run in 10-fold cross-validation mode, and, again, the column corr. denotes the result of the Pearson test of correlation, in average more than the 10 executions. We have also displayed the results of the standard residuals test: the mean absolute error, root mean squared error, the root absolute error, and the root squared error. The results of the test phase are shown in Tables 4 and 5.

Table 4.

Test results for NO2.


Table 5.

Test results for NOX.



In view of the results that have been obtained, 2 important elements emerge: first, how lags and nonlinear contributions are explained in the physical process, and, second, how the transformations improve the synthesis of regression models. As much as the first point is concerned, it is important to understand that different executions may give rise to different results and yet similar performances. On one side, some lags are consistent in all executions, which indicates that the temporal component of their contribution is stable and clear. On the other side, some lags and some individual coefficients do show certain variability: we believe that in such cases, more experiments are necessary to fully understand the underlying mechanisms. All notwithstanding, it is clear how convolution vectors have a clear positive effect on the cross-validated performances across the entire spectrum of algorithms for regression that we have tried. The following considerations can be done.

  • Traffic influences the amount of pollutant concentrations in a positive way, and with 1 hour of delay; this delay can be explained by the fact that the effect of exhaust gases needs some time to accumulate (observe, also, that we are limited by the granularity of observations: if sensors had collected data with granularity, say, 1 minute, we may have seen shorter delays for this variable).

  • Higher air temperatures are associated with a decrease in NO2 and NOX concentrations, which is caused by 3 processes: (1) at higher air intake temperatures, less NO2 is produced in the process of fuel combustion in the engine; (2) higher air temperatures usually imply more favorable atmospheric conditions, which encourage residents to use alternative means of transportation which reduces traffic volume64; and (3) NO2 is dynamically transformed into NO (and back) at a higher rate with higher temperatures.

  • Wind always affects concentration in a negative way, with a constant (across different executions) delay of 2 hours. This is probably due to the distance between the intersection and the meteorological station; the average wind speed of 3 m s1 and the roughness of terrain explain the amount of the delay.

  • Higher humidity may cause a decrease in NO2 concentration in exhaust gases65,66 ; we have found that such an effect takes place with 5 to 6 hours of delay.

  • High solar duration and high temperature favor the transformation of nitrogen oxides into secondary pollutants, which include ozone, and this implies a decrease in NOX concentration; even more important is sunlight, which acts as a catalyst, which explains the negative coefficient and the delay for sunlight duration.

Of how the transformation influences the synthesis of good regression models, we can observe what follows. First, not all models show the same improvement, but all models show some improvement. In the case ofNO2, the models extracted with linear regression showed a maximum improvement of 0.948 in correlation, those with random forest of 0.544, and those with multilayer perceptron of 0.563, and in the case of NOX, the improvements have been of 0.461, 0.398, and 0.415, respectively. Also, observe that there was an improvement in every independent execution, indicating that the results are stable. The fact that linear regression presented the most evident improvement is expected, given that used the same algorithm for optimization. Second, the improvement in predicting the NO2 concentration is greater than the one in predicting NOX, probably due to the fact that the latter is a more complex problem. Third, linear regression and random forest show greater improvements than multilayer perceptron; the latter, however, shows low results even with nontransformed data, possibly suggesting that the underlying problem is not highly nonlinear enough to require, and justify, a neural network–based approach.

Comparing our results from the statistical point of view with those that can be obtained by other, well-known, methods is quite difficult. Take NO2 prediction models, for example. Running simple, atemporal, linear regression on the same data yields a correlation coefficient that, in average, is 0.678 in test (single execution). The extracted model is:


While the coefficients are not too dissimilar from those obtained with the transformations, the reduced correlation shows that delays must be taken into account. Running full lagged models, on the other hand, leads, in some cases, to higher correlations. Unfortunately, the resulting models are impossible to be interpreted from the environmental point of view. Just for example, the simplest full lagged model that we tried, with 6 hours of maximum delay for each variable (ie, with 36 independent variables), already leads to positive and negative coefficients of the same independent variables at different delays: the variable temperature, for instance, presents positive coefficients for lag 0, 4, and 6, and negative ones in all the other lags. A possible explanation of this phenomenon is that by artificially increasing the number of independent variables (such as it is done when several lagged variables are added for each parameter), one allows regression algorithms to find artificially good solutions by using many hyperplanes; in other words, the solution tends to overfit even in presence of several thousands of data instances. Overfitting models are less reliable explanatory models because they tend not to be associated with physical explanations.


In this article, we have approached the problem of devising an explanation model for NO2 and NOX concentrations observed via a monitoring station in the city of Wrocław (Poland). First, we revised the current literature on prediction, forecasting, and explanation models for temporal series. Then, we formulated the problem of finding an explanation model for air pollution as a lagged regression problem, and designed an optimization problem whose solutions are precisely the optimal lags for regression. Finally, we proposed realistic and physically explicable lagged regression models for both NO2 and NOX concentrations based on available data from 2015 to 2017, induced with our method. We obtained significative improvements in the coefficient of determination over nonlagged regression models, while retaining (in fact, easing) the interpretability of the resulting equations. Our technology is immediately applicable to the same problems with different data, as well as to similar problems in which the temporal component plays an essential role. Time series explaining is less common in the technical literature than time series forecasting. Yet, forecasting is not focused on interpretability, often resulting in (sometimes, statistically reliable) black box models which, however, offer no explanation of the underlying phenomena. For future work, we plan to improve our design by allowing our optimization schema to consider reasonable temporal combinations of lagged variables, yet retaining the superior interpretability degree of lagged models over statistical forecasting.


GS: experiment’s conception and design, results’ technical interpretation, and text writing. EL-S: implementation, test, and text writing. JK: data providing, results’ physical interpretation.



Kowalska M , Skrzypek M , Kowalski M , et al. The relationship between daily concentration of fine particulate matter in ambient air and exacerbation of respiratory diseases in Silesian agglomeration, Poland. Int J Environ Res Public Health. 2019;16:1131. Google Scholar


Holnicki P , Tainio M , Kałuszko A , et al. Burden of mortality and disease attributable to multiple air pollutants in Warsaw, Poland. Int J Environ Res Public Health. 2017;14:1359. Google Scholar


Schwartz J. Lung function and chronic exposure to air pollution: a cross-sectional analysis of NHANES II. Environ Res. 1989;50:309–321. Google Scholar


Cesaroni G , Badaloni C , Gariazzo C , et al. Long-term exposure to urban air pollution and mortality in a cohort of more than a million adults in Rome. Environ Health Perspect. 2013;121:324–331. Google Scholar


Peng RD , Dominici F , Louis TA. Model choice in time series studies of air pollution and mortality. J Royal Stat Soc Series A Stat Soc. 2006;169:179–203. Google Scholar


Schwartz J. Air pollution and blood markers of cardiovascular risk. Environ Health Perspect. 2001;109:405–409. Google Scholar


Mar TF , Norris GA , Koenig JQ , Larson TV. Associations between air pollution and mortality in Phoenix, 1995-1997. Environ Health Perspect. 2000;108:347–353. Google Scholar


Vanos JK , Cakmak S , Kalkstein LS , Yagouti A. Association of weather and air pollution interactions on daily mortality in 12 Canadian cities. Air Qual Atmos Health. 2015;8:307–320. Google Scholar


Knibbs LD , de Waterman AMC , Toelle BG , et al. The Australian child health and air pollution study: a national population-based cross-sectional study of long-term exposure to outdoor air pollution, asthma, and lung function. Environment International. 2018;120:394–403. Google Scholar


Cifuentes LA , Vega J , Köpfer K , Lave LB. Effect of the fine fraction of particulate matter versus the coarse mass and other pollutants on daily mortality in Santiago, Chile. J Air Waste Manag Assoc. 2000;50:1287–1298. Google Scholar


Kryza M , Szymanowski M , Dore AJ , et al. Application of a land use regression model for calculation of the spatial pattern of annual NOx air concentrations at national scale: a case study for Poland. Proc Environ Sci. 2011;7:98–103. Google Scholar


Aguilera I , Foraster M , Basagaña X , et al. Application of land use regression modelling to assess the spatial distribution of road traffic noise in three European cities. J Expo Sci Environ Epidemiol. 2015;25:97–105. Google Scholar


Box G , Jenkins G , Reinsel G , et al. Time Series Analysis: Forecasting and Control. London, UK: Wiley; 2016. Google Scholar


Hall M. Time series analysis and forecasting with WEKA, 2014. Scholar


Jollois FX , Poggi JM , Portier B. Three non-linear statistical methods for analyzing PM10 pollution in Rouen area. Case Stud Bus Indus Govern Stat. 2009;3:1–17. Google Scholar


Lv B , Cobourn WG , Bai Y. Development of nonlinear empirical models to forecast daily PM2.5 and ozone levels in three large Chinese cities. Atmosp Environ. 2016;147:209–223. Google Scholar


Singh KP , Gupta S , Kumar A , et al. Linear and nonlinear modeling approaches for urban air quality prediction. Sci Total Environ. 2012;426:244–255. Google Scholar


Cobourn WG. An enhanced PM2.5 air quality forecast model based on nonlinear regression and back-trajectory concentrations. Atmosph Environ. 2010;44:3015–3023. Google Scholar


Shi JP , Harrison RM. Regression modelling of hourly NOx and NO2 concentrations in urban air in London. Atmosp Environ. 1997;31:4081–4094. Google Scholar


Maciag P , Kasabov N , Kryszkiewicz M , et al. Air pollution prediction with clustering-based ensemble of evolving spiking neural networks and a case study for London area. Environ Model Software. 2019;118:262–280. Google Scholar


Kamińska JA. The use of random forests in modelling short-term air pollution effects based on traffic and meteorological conditions: a case study in Wrocław. J Environ Manage. 2018;217:164–174. Google Scholar


Balashov NV , Thompson AM , Young GS. Probabilistic forecasting of surface ozone with a novel statistical approach. J Appl Meteorol Climatol. 2017;56:297–316. Google Scholar


Aznarte JL. Probabilistic forecasting for extreme NO2 pollution episodes. Environ Pollut. 2017;229:321–328. Google Scholar


Cabaneros SM , Calautit JK , Hughes BR. A review of artificial neural network models for ambient air pollution prediction. Environ Model Software. 2019;119:285–304. Google Scholar


Kamińska JA. A random forest partition model for predicting NO2 concentrations from traffic flow and meteorological conditions. Sci Total Environ. 2019;651:475–483. Google Scholar


Sayegh A , Tate JE , Ropkins K. Understanding how roadside concentrations of NOx are influenced by the background levels, traffic density, and meteorological conditions using boosted regression trees. Atmosp Environ. 2016;127:163–175. Google Scholar


Singh KP , Gupta S , Rai P. Identifying pollution sources and predicting urban air quality using ensemble learning methods. Atmosp Environ. 2013;80:426–437. Google Scholar


Shang Z , Deng T , He J , et al. A novel model for hourly PM2.5 concentration prediction based on CART and EELM. Sci Total Environ. 2019;651:3043–3052. Google Scholar


Zhai B , Chen J. Development of a stacked ensemble model for forecasting and analyzing daily average PM2.5 concentrations in Beijing, China. Sci Total Environ. 2018;635:644–658. Google Scholar


Kamińska JA. Probabilistic forecasting of nitrogen dioxide concentrations at an urban road intersection. Sustainability. 2018;10:1–16. Google Scholar


Holt C. Forecasting seasonals and trends by exponentially weighted moving averages. Int J Forecast. 2004;20:5–10. Google Scholar


Winters P. Forecasting sales by exponentially weighted moving averages. Manage Sci. 1960;3:324–342. Google Scholar


Poulos L , Kvanli A , Pavur R. A comparison of the accuracy of the box-Jenkins method with that of automated forecasting methods. Int J Forecast. 1987;3:261–267. Google Scholar


Hochreiter S , Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. Google Scholar


Russo A , Lind PG , Raischel F , et al. Neural network forecast of daily pollution concentration using optimal meteorological data at synoptic and local scales. Atmosp Pollut Res. 2015;6:540–549. Google Scholar


Vlachogiannis D , Sfetsos T . Time series forecasting of hourly pm10 values: model intercomparison and the development of localized linear approaches:85-94. Scholar


Brunello A , Kaminska J , Marzano E , et al. Assessing the role of temporal information in modelling short-term air pollution effects based on traffic and meteorological conditions: a case study in Wrocław. In: Proceedings of the Workshop New Trends in Databases and Information Systems, Communications in Computer and Information Science, Vol. 1064. Berlin: Springer; 2019:463–474. Google Scholar


Analitis A , Katsouyanni K , Dimakopoulou K , et al. Short-term effects of ambient particles on cardiovascular and respiratory mortality. Epidemiology. 2006;17:230–233. Google Scholar


Liu T , Li TT , Zhang YH , et al. The short-term effect of ambient ozone on mortality is modified by temperature in Guangzhou, China. Atmosp Environ. 2013;76:59–67. Google Scholar


Parodi S , Vercelli M , Garrone E , Fontana V , Izzotti A. Ozone air pollution and daily mortality in Genoa, Italy between 1993 and 1996. Public Health. 2005;119:844–850. Google Scholar


Catalano M , Galatioto F , Bell M , et al. Improving the prediction of air pollution peak episodes generated by urban transport networks. Environ Sci Policy. 2016;60:69–83. Google Scholar


Wang H , Mao W , Eriksson L. A three-dimensional Dijkstra’s algorithm for multi-objective ship voyage optimization. Ocean Eng. 2019;186. Google Scholar


Kollat JB , Reed PM. Comparing state-of-the-art evolutionary multi-objective algorithms for long-term groundwater monitoring design. Adv Water Res. 2006;29:792–807. Google Scholar


Madsen H. Automatic calibration of a conceptual rainfall-runoff model using multiple objectives. J Hydrol. 2000;235:276–288. Google Scholar


Pal D , Galelli S. A numerical framework for the multi-objective optimal design of check dam systems in erosion-prone areas. Environ Model Software. 2019;119:21–31. Google Scholar


Qiu J , Shen Z , Leng G , et al. Impacts of climate change on watershed systems and potential adaptation through BMPs in a drinking water source area. J Hydrol. 2019;573:123–135. Google Scholar


Guyon I , Elisseeff A . An introduction to variable and feature selection. J Mach Learn Res. 2003:1157–1182. Google Scholar


Collette Y , Siarry P. Multiobjective Optimization: Principles and Case Studies. Berlin; Heidelberg: Springer; 2004. Google Scholar


Deb K. Multi-Objective Optimization Using Evolutionary Algorithms. London, UK: Wiley; 2001. Google Scholar


Jiménez F , Sánchez G , García J , et al. Multi-objective evolutionary feature selection for online sales forecasting. Neurocomput. 2017;234:75–92. Google Scholar


Emmanouilidis C , Hunter A , Macintyre J , et al. A multi-objective genetic algorithm approach to feature selection in neural and fuzzy modeling. Evolut Optim. 2001;3:1–26. Google Scholar


Mukhopadhyay A , Maulik U , Bandyopadhyay S , et al. A survey of multiobjective evolutionary algorithms for data mining: part I. IEEE Trans Evolut Comput. 2014;18:4–19. Google Scholar


Siedlecki W , Sklansky J . A note on genetic algorithms for large-scale feature selection. In: Chen C (ed.) Handbook of Pattern Recognition and Computer Vision. Singapore: World Scientific; 1993:88–107. Google Scholar


Ishibuchi H , Nakashima T . Multi-objective pattern and feature selection by a genetic algorithm. In: Proceedings of the Genetic and Evolutionary Computation Conference; 2000:1069–1076. Scholar


Liu J , Iba H . Selecting informative genes using a multiobjective evolutionary algorithm. In: Proceedings of the 2002 Congress on Evolutionary Computation, Vol. 1. New York, NY: IEEE; 2002:297–302. Google Scholar


Pappa GL , Freitas AA , Kaestner C . Attribute selection with a multi-objective genetic algorithm. In: Proceedings of the 16th Brazilian Symposium on Artificial Intelligence. New York, NY: IEEE; 2004:280–290. Google Scholar


Shi S , Suganthan P , Deb K . Multiclass protein fold recognition using multiobjective evolutionary algorithms. In: Proceedings of the 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology. New York, NY: IEEE; 2004:61–66. Google Scholar


García-Nieto J , Alba E , Jourdan L , et al. Sensitivity and specificity based multiobjective approach for feature selection: application to cancer diagnosis. Inform Process Lett. 2009;109:887–896. Google Scholar


Jiménez F , Jódar R , Martín M , et al. Unsupervised feature selection for interpretable classification in behavioral assessment of children. Expert Syst. 2017;34:1–15. Google Scholar


Jiménez F , Martínez C , Marzano E , et al. Multiobjective evolutionary feature selection for fuzzy classification. IEEE Trans Fuzzy Syst. 2019;27:1085–1099. Google Scholar


Sarigiannis DA , Saisana M. Multi-objective optimization of air quality monitoring. Environ Monit Assess. 2008;136:87–99. Google Scholar


Durillo J , Nebro A. Jmetal: a Java framework for multi-objective optimization. Adv Eng Software. 2011;42:760–771. Google Scholar


Witten I , Frank E , Hall M. Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. USA:Morgan Kaufmann, Elsevier; 2011. Google Scholar


Chalfen M , Kamińska J. Analiza wykorzystania roweru miejskiego we wrocławiu (in Polish). Autobusy Technika, Eksploatacja, Systemy Transportowe. 2016;17:543–545. Google Scholar


Toback AT , Hearne JS , Kuritz B , et al. The effect of ambient temperature and humidity on measured idling emissions from diesel school buses. SAE Trans. 2004;113:530–538. Google Scholar


Krause SR , Merrion DF , Green GL. Effect of inlet air humidity and temperature on diesel exhaust emissions. SAE Trans. 1973;82:831–837. Google Scholar
© The Author(s) 2020 This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License ( which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (
Joanna Kamińska, Estrella Lucena-Sánchez, and Guido Sciavicco "Temporal Aspects in Air Quality Modeling—A Case Study in Wrocław," Air, Soil and Water Research 13(1), (31 December 2020).
Received: 10 August 2020; Accepted: 23 October 2020; Published: 31 December 2020

Back to Top