The inaugural conference on Data Science, Statistics and Visualisation (DSSV2017) was an interesting change from other conferences. The audience was mixed, with a good proportion (roughly 50%) coming from industry. Many sessions reminded me of the Meetups that occur in Sydney, the balance between talks containing math and other talks remaining at a high level was appealing.
In this conference I talked about the medal plot, a graphic for visualising uncertainty in space-time fields that was developed by Jonty Rougier and myself during my stay in Bristol. The medal plot can be used to see the effect of considering spatial correlations in your model, and the effect of your prior judgement on your posterior uncertainties. More details are given in our paper here.
The conference opened with an interesting talk by Trevor Hastie who gave a run-through of subset selection methods before introducing a new one: Relaxed LASSO. Interesting points, some of which we are aware of already:
- In Trevor’s experience, forward subset selection is (in practice) as good as best subset selection, but of course orders of magnitude faster. Forward selection works with n < p unlike, of course, backward selection.
- An exhaustive search for best subset selection can be easily carried out on up to 35 covariates using the leaps package in R.
- Recently Bertsimas, King, and Mazumder cracked the best-subset selection curse of dimensionality using mixed-integer programming and the GUROBI solver (which can be used in R). However is best subset selection always the best? No, it tends to overfit the data.
- LASSO can be seen as a “convex relaxation” of the best-subset selection problem (which has an L0 penalty term instead of an L1, and is hence admits a non-convex cost function).
- In GLMNET, LASSO is implemented using pairwise coordinate descent and computes the optimal solution over a fine grid of the regularisation parameter lambda.
- Trevor spent some time talking about the connections between Least-Angle Regression, LASSO, and forward subset selection which was interesting.
- In Relaxed LASSO (which was new to me) one uses LASSO to obtain the non-zero terms. Then one computes a linear combination of the LASSO estimates and the OLS estimates of those non-zero terms. Note that this is different from elastic net, in which the cost function contains both an L1 and an L2 penalty, but is similar in spirit. Trevor claimed that this reduces the shrinkage bias in vanilla LASSO methods.
- Trevor showed the results of an extensive simulation study containing experiments with different SNRs and covariate intra-correlations. Interestingly, the best subset and forward subset selection methods only outperformed the LASSO methods at high SNRs. The relaxed LASSO consistently outperformed LASSO.
Another talk I found very interesting was that by Andre Martins on constructing sparse probability distributions by replacing the ubiquitous softmax function with a sparsemax function. More details are given in his paper here. The problem is formulated as an optimisation problem, where one is projecting the weights onto the probability simplex. While I guess this part is not new, Andre constructed a loss function inspired by that of the softmax to obtain simple optimisation algorithms.
David Banks talked about emulation of agent-based models (ABMs), something which I am personally sceptical about since ABMs are useful because they are chaotic in nature, a feature emulators would struggle to capture. Yet, when I enquired about this, David said he and his student had considerable success in emulation and that the discrepancy term is able to capture the “unpredictable component” in ABMs fairly well so I will keep an open mind on this one!
On the last day we had several interesting talks. Mario Figueiredo talked about the OSCAR loss function which deals with correlated variables in a different way than LASSO. While LASSO will assign all the weight to one of the correlated variables, OSCAR will assign equal weight to the correlated variables. There is an interesting connection between the “shrinking to exactly zero” of the LASSO and the “shrinking to exactly the same weight for correlated variables” of the OSCAR criterion. Mario extended this to an OWL criterion which generalises the OSCAR criterion (but I’m not sure in what way!).
Peter Rousseeuw talked about detecting cell-wise outliers (as opposed to row-wise outliers) in data. He showed some impressive figures obtained using his package cellWise. The Technometrics paper describing the technique is here. I do have an issue with calling outliers and anomalies the same, something which every speaker asserted in this conference. In my opinion, an outlier is a property of the data, while an anomaly is a property of the process. That is, an outlier is something we would like to filter out, while an anomaly is something we would like to keep, model, and even predict!
After Peter, Google (with three speakers) gave a talk on market attribution models that involved using non-stationary Markov models to simulate user behaviour. Finally, Daniel Keim gave a very impressive talk on visual analytics. He took the audience through a graphical journey involving Twitter feeds for disaster detection, IP monitoring for attack detection, and the analysis of soccer team strategies from millisecond data of players’ positions on the field. All this was after lambasting statisticians somewhat for solving “small” problems using only “selective” data (but which we do “very well”).
I of course do not agree with Daniel; we do what we do not because we are scared of dealing with “very” big data, but because it is not always necessary to work with all the data one could possibly collect, to solve big problems. I think Daniel focuses on some important fields which indeed do require a lot of data (and which is hence technologically very challenging), but I disagree that that is always”the real world”. Further, like in visual analytics, we also use process chains (starting from data collection, to data pre-processing, to feature selection, to modelling, to inference, to visualisation) much like what is needed in the truly big-data cases he described.
One point which concerned me what that there was no mention of uncertainty. This is not a problem when raw data is displayed, but when an inference is conveyed, decision-making on a point prediction without a notion of risk can be dangerous.
One of the graphics Daniel showed which I never tire of seeing was Anscombe’s quartet:
Source: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
All the four data sets shown above have the same marginal means, the same correlation coefficient, and the same best linear fit. Daniel used this to emphasise the importance of visualisation. An extreme case of this is shown in this lovely video below by Justin Matejka; this is an excellent resource for STAT101!