What I did, Why I did it, and What I learned

Background

Welcome to SonicPredict, which is a small Flask web app I built to serve my fitted machine learning model from the 2020 SPWLA PDDA Machine Learning Contest. The competition took place between March-May 2020 with a total of 31 teams participating. The goal of the competition was to predict P- and S-sonic well logs using other commonly acquired well logs. This is a problem that arises commonly in the oil & gas industry, either as a result of budget constraints, poor borehole conditions, or sometimes the data is simply missing, either because of time passed, mismanagement and/or lack of records, etc ...

Motivation

P- and S-sonic curves are acoustic measurements of the travel time, or slowness, of sound through the subsurface. Sonic slowness is the reciprocal of velocity, so faster (~denser and/or stiffer) rocks have smaller slowness values. In other words, it takes less time for the sound wave to travel through the rock. Sonic curves are very important borehole measurements to make because when they are combined with density measurements, geoscientists are able to approximate each subsurface layers' elastic properties such as Acoustic and Shear Impedance, Bulk modulus, Young's modulus, Shear modulus, and Poisson's ratio. These properties are crucial in calibrating the seismic response of the subsurface and allow us to better estimate subsurface rock properties away from a well bore.

Before a geoscientist can seismically calibrate the elastic properties mentioned above, wells first need to be tied. Seismic is natively acquired in the time domain and its natural vertical axis is two-way travel time (two-way because of the time it takes a sound wave to travel down into the subsurface and return after bouncing off a reflector). By measuring travel-time in the wellbore with a sonic curve, the well data, which is naturally measured in depth, can be tied to the seismic two-way time axis. Once this crucial well tie is established, the actual elastic property calibration can begin.

Once the time-to-depth relationship is established at the wellbore, a velocity model can be built to convert the seismic to the depth domain, or to calibrate one built using technologies such as tomography. As more wells are drilled in an area, this velocity model is improved. There are several reasons for converting seismic from the time- to the depth-domain. First, the depth domain more closely represents the true structure of the subsurface. This is important for understanding things like stratigraphic dip, closures, traps, and amplitude conformance to structure. Second, it makes assessing gross rock volume of a reservoir substantially easier as the units in depth actually relate to a physical volume. Third, it just makes more sense to think about the subsurface in terms of depth rather than travel time.

Sonic curves are also very crucial in estimating pore pressure given their sensitivity to pore fluids. Having accurate and calibrated models of subsurface pore pressure are critical for safe and efficient drilling. Sonic curves can approximate pore pressure via several well known transforms, and this in turn can be used to calibrate a seismic velocity field to a seismic pore pressure model. Not only is such a tool useful for estimating pre-drill pore pressure at new drilling locations, but it can also be very useful in understanding subsurface fluid flow from source rock to reservoir.

It should hopefully now be clear why sonic curves are so important. However, as described earlier, they are not always available, and even if they are, the measurements are not always reliable if borehole conditions are poor. For these reasons, if we are able to build novel machine learning models that predict P- and S-sonic curves with reasonable accuracy, we would be able to remediate the issues we face when these curves are either missing or of bad quality.

2020 SPWLA PDDA Machine Learning Competition

Each team was given the same training data set from which to build novel machine learning solutions. Additionally, we were given a 20% sample of the blind test data with which to test our model accuracy. The ultimate goal was to minimize the combined RMSE of both predicted logs (DTC & DTS). The winner of the competition was the team whose final submission resulted in the lowest RMSE when tested on the full blind test data. The data comes from Equinor's Volve field which was open-sourced in 2018.

The training data consisted of the following well logs:

Caliper (Cal)
Gamma Ray (GR)
Medium Resistivity (HRM)
Deep Resistivity (HRD)
Porosity (CNC)
Density (ZDEN)
Photoelectric Factor (PE)
P-Sonic (DTC)
S-Sonic (DTS)

and the targets were both P- and S-sonic logs. My final model used GR, HRD, CNC, & ZDEN, and consisted of an average ensemble of a Random Forest Regression, Gradient Boosting Regressor, XGBoost, Principal Component Regression, & KNN Regression. This model achieved an RMSE of 16.31 and finished 9^th of the 20 teams which were ranked. The top team achieved an RMSE of 12.36.

Workflow Summary

For more details, you can refer to my final submission Jupyter notebook as well as my GitHub repository for the competition linked above in the navbar. The actual notebook is too large to render directly in the repository, so instead I have linked to github1s which opens the notebook in VSCode. Clicking 'show in preview' on the right hand side will render the notebook.

1) Data Loading and Initial Inspection

Inspect summary statistics for data min/max
Manually clip min/max based on physical limits
Inspect histograms for each curve to understand distribution shapes

2) Application of Inter-Quartile Ranger (IQR) Filter

Log-transform each curve to approximate normal distribution
Filter statistical outliers using IQR range
Exponentiate each filtered curve to return to original limits

3) Final Feature Selection

CAL dropped due to poor data
HRM dropped as it contains nearly duplicate information as HRD
PE dropped as a non-significant feature

4) Pre-processing Pipeline

log transform each curve
Apply Scikit-Learn StandardScaler to scale all data
Principal Components Analysis, keeping first three PC's (96.4% of variation described)

5) Model Fitting and Hyperparameter Tuning via 5-fold Grid Search Cross-Validation

Random Forest Regressor
Gradient Boosting Regressor
XGBoost
Principal Component Regression using SVR
KNN Regression

6) Average Ensemble of models selected as final model

$$\hat{y}_{ensemble} = \frac{\hat{y}_{RFR} + \hat{y}_{GBR} + \hat{y}_{XGB} + \hat{y}_{PCR} + \hat{y}_{KNN}}{5}$$

7) Make Predictions for Blind Test Data

Learnings

Simplicity

One of my guiding principals has always been to prefer simplicity over complexity and to only add complexity as needed to effectively solve the problem at hand. I didn't adhere very strictly to this personal mantra during this project and jumped eagerly into using complex models and hyperparameter tuning. In hindsight, I'm still not convinvced that simpler linear models would have performed better, but I wish I had spent a little more time investigating linear models and looking for possible interaction terms amongst the predictors. There are many well-known linear models currently employed by geoscientists derived from empirical measurements (such as the famous Castagna mud-rock line), and I think that if a linear model could work and also be significantly more interpretable than a Random Forest, this type of work might gain traction faster in the oil and gas industry. However, given the limited time to work on this project, I quickly bypassed linear models in favor of tree-based methods, which did appear to work reasonably well.
Feature Engineering

I did not spend much time on feature engineering. I did apply several transforms to my feature set, such as log-transforming, standard scaling, and Principal Component Analysis, but that only serves to represent the features in a manner with hopefully less variation and a closer approximation to normality. Comparing my results to the top five, a number of the teams employed some form of clustering for common lithotypes, and when this feature was added to their models, it appears to have had a positive effect.

We were not told the true depth or units of the data, although I wish I had tested some assumptions based on the number of samples and known typical log step increments. It is well-established that rock properties change with increasing depth as a result of burial and compaction and I think if I could have introduced some features which gave information about depth and compaction trend it would have provided a positive benefit to the model.

One thing that is apparent from creating a lightened version of my model for the API is that there was very little additional uplift provided in including the Random Forest and Gradient Boosting Regressor models to the average ensemble of XGBoost, PCR, and KNN Regression. I think this underscores that feature engineering would have provided more benefit rather than spending more time on tuning hyperparameters.

I did attempt to use the Photoelectric Factor (PE) and Density (ZDEN) curves to create RHOMAA-UMAA logs. Cross-plotting RHOMAA-UMAA is a very effective way to separate lithologies. While I had some success, I did not exactly have all the data necessary to do this correctly and ultimately abandoned it when I did not feel that the transformed data was correctly cross-plotting on established rock physics templates. I think using an unsupervised clustering technique such as discussed above would have possibly been more beneficial given imperfect and incomplete data.
Model Selection

I chose to work primarily with tree-based methods, such as Random Forest, Gradient Boosting, and XGBoost. Of the top five entries, three used tree-based models, and two used neural networks. I think this reinforces my previous learnings that feature engineering was more important in improving model accuracy than the actual model itself. I think it also underscores how strongly tree-based methods perform and in scenarios where predictability is more important than model interpretability, they can be very reliable choices.

Several of the top five entries also chose to build separate models for predicting DTC and DTS, sometimes using predicted DTC as an input for DTS. The models I built performed multi-target regression for both targets. I think attempting to first predict DTC and then DTS may have resulted in model improvement. DTS typically tracks DTC very closely so it makes sense to first find a robust solution for DTC and use it then as a predictor for DTS.

Several of the models I used (Gradient Boosting Decision Trees, XGBoost, Ridge Regression, & Support Vector Regression) do not natively support Multi-Target Regression, meaning they cannot directly regress for two different response targets. To get around this, I used Scikit-Learn's MultiOutputRegressor to wrap the model as a part of the pre-processing pipeline. One assumption made by MultiOutputRegressor is that the two targets are independent of one another. As I just mentioned, we know that DTC and DTS are related, so perhaps this is evidence to support that I should have built separate models for each.

Conclusions

All-in-all, I really enjoyed participating in this contest. The timing coincided with the first two months of COVID-19 quarantine lock-down and so it gave me something fun to work on in the evenings as we adjusted to working from home and having very limited external contact. This was the first machine learning contest I've competed in and I'm happy to have finished in the top ten. I think I came away with some great takeaways that will be useful to future projects.

Even though my model is not the top performer, I think I have a nice portfolio project as a result of participating. Recently, I built my first Flask web app (PySeisTuned^2.0), and this project allowed me to learn how to build a Flask API for serving a machine learning model. Building predictive models is a lot of fun, but if nobody can use them once they are built, then you aren't very useful as a data scientist.

I think one of the reasons data science appeals to me so much is that it combines a number of my intellectual interests: coding, mathematics, statistics, physical and social science, and software development. Coding and programming has been an interest, hobby, and passion of mine since I was in middle school. It's a constant learning journey and it requires patience and persistance. Moments of joy usually result from hours of frustration. If there is anyone out there that has questions, I am more than happy to be a resource! I don't have all the answers, not by a long shot, but I like to help when I can. Feel free to use the contact link below to send me a note and I'll get back to you.