User:Ifarley/Facies classification using Neural Network algorithm
Machine Learning (ML) is a field of Artificial Intelligence (AI) that has experienced rapid growth in the last ten years across diverse industries, including communications, financial services, security, transportation, and others. Applications of machine learning have produced dramatic results, enabling new opportunities and business models. Machine learning facilitates an understanding of complex relationships among a large and diverse set of variables, valuable for generating and validating models and answering scientific questions. Machine learning can enable fast high-quality decisions in the oil and gas industry, an essential component for viability given the industry’s long-term outlook. Geoscience datasets are among the largest volumes of data in the industry. The data has a wide spectrum of properties with scales varying over many orders of magnitude. This workshop will discuss the challenges, opportunities, and trends related to the adoption of machine learning in geoscience research and industrial workflows.
There has been many free and open-source packages now exist that provide
powerful additions to the geoscientist's toolbox, much of which used to be only
available in proprietary (and expensive) software platforms. One of the best examples is
scikit-learn (http://scikit-learn.org/), a collection of
tools for machine learning in Python.In this tutorial, we will
demonstrate how to use a classification algorithm known as neural network to
identify lithofacies based on well-log measurements. A neural network algorithm
is a type of supervised-learning algorithm, which needs to be supplied with
training data to learn the relationships between the measurements (or features)
and the classes to be assigned. In our case, the features will be well-log data
from nine gas wells. These wells have already had lithofacies classes assigned
based on core descriptions. Once we have trained a classifier, we will use it
to assign facies to wells that have not been described. Also at the end we will
have a result comparison between results from neural network and support vector
machine with the same dataset.
What is Neural Network
Here is a basic introduction for neural network algorithm in Wikipedia.  The main purpose of this tutorial is to focus on the application of neural networks on facies classification so we won't talk too much about the algorithm itself.
Exploring the data set
The data set we will use comes from a University of Kansas class exercise on the Hugoton and Panoma gas fields. For more on the origin of the data, see Dubois et al. (2007) and the Jupyter notebook that accompanies this tutorial at http://github.com/seg.
The following code part I should give all the credit to the author of this seg wiki http://wiki.seg.org/wiki/Facies_classification_using_machine_learning. However there are two major changes: firstly, the
sns.pairplot function has changed, so the right format should be
>>> sns.pairplot(feature_vectors,hue='Facies'); second one is that since we use neural network, all the codes along with support vector machine should be changed to neural network.
The data set consists of seven features (five wireline log measurements and two indicator variables) and a facies label at half-foot depth intervals. In machine learning terminology, the set of measurements at each depth interval comprises a feature vector, each of which is associated with a class (the facies type). We will use the
pandas library to load the data into a dataframe, which provides a convenient data structure to work with well-log data.
>>> import pandas as pd
>>> data = pd.read_csv(‘training_data.csv’)
We can use
data.describe() to provide a quick overview of the statistical distribution of the training data (Table 1).
We can see from the count row in Table 1 that we have a total of 3232 feature vectors in the data set. The feature vectors consist of the following variables:
- Gamma ray (GR)
- Resistivity (ILD_log10)
- Photoelectric effect (PE)
- Neutron-density porosity difference (DeltaPHI)
- Average neutron-density porosity (PHIND)
- Nonmarine/marine indicator (NM_M)
- Relative position (RELPOS)
There are nine facies classes (numbered 1–9) identified in the data set. Table 2 contains the descriptions associated with these classes. Note that not all of these facies are completely discrete; some gradually blend in to one another. Misclassification of these neighboring facies can be expected to occur. The
Adjacent facies column in Table 2 lists these related classes.
|2||Nonmarine course siltstone||CSiS||1,3|
|3||Nonmarine fine siltstone||FSiS||2|
|4||Marine siltstone and shale||SiSh||5|
To evaluate the accuracy of the classifier, we will remove one well from the training set so that we can compare the predicted and actual facies labels.
>>> test_well = data[data[‘Well Name’] == ‘NEWBY’]
>>> data = data[data[‘Well Name’] != ‘NEWBY’]
Let's extract the feature vectors and the associated facies labels from the training data set:
>>> features = [‘GR’, ‘ILD_log10’, ‘DeltaPHI’, ‘PHIND’,‘PE’,‘NM_M’, ‘RELPOS’]
>>> feature_vectors = data[features]
>>> facies_labels = data[‘Facies’]
Crossplots are a familiar tool to visualize how two properties vary with rock type. This data set contains five log measurements, and we can employ the very useful
seaborn library https://seaborn.pydata.org/ (Waskom et al., 2016) to create a matrix of crossplots to visualize the variation between the log measurements in the data set.
>>> import seaborn as sns
Each pane in Figure 1 (next page) shows the relationship between two of the variables on the x and y axis, with a stacked bar plot showing the distribution of each point along the diagonal. Each point is colored according to its facies (see the Jupyter notebook associated with this tutorial for more details on how to generate colors for this plot). It is not clear from these crossplots what relationships exist between the measurements and facies labels. This is where machine learning will prove useful.
Conditioning the data set
Many machine-learning algorithms assume the feature data are normally distributed (i.e., Gaussian with zero mean and unit variance). Table 1 shows us that this is not the case with our training data. We will condition, or standardize, the training data so that it has this property. The same factors used to standardize the training set must be applied to any subsequent data set that will be classified.
scikit-learn includes a handy
StandardScalar class that can be applied to the training set and later used to standardize any input data.
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler().fit(feature_vectors)
>>> scaled_features = scaler.transform(feature_vectors)
A standard practice when training supervised-learning algorithms is to separate some data from the training set to evaluate the accuracy of the classifier. We have already removed data from a single well for this purpose. It is also useful to have a cross-validation data set we can use to tune the parameters of the model.
scikit-learn includes a handy function to randomly split the training data into subsets. Let's use 5% of the data for the cross-validation set.
>>> from sklearn.cross_validation import train_test_split
>>> X_train, X_cv, y_train, y_cv = \ train_test_split(scaled_features, facies_labels, test_size=0.05, random_state=42)
Training the classifier
Now we can use the conditioned data set to train a neural network to classify facies.
The Neural Network implementation in
scikit-learn (http://scikit-learn.org/stable/modules/neural_networks_supervised.html) takes a number of important parameters. These can be used to control the learning rate and the specifics of the kernel functions. The choice of parameters can affect the accuracy of the classifier, and finding optimal parameter choices is an important step known as model selection. A succession of models is created with different parameter values, and the combination with the lowest cross-validation error is used for the classifier. See the notebook accompanying this article for more details on the model selection procedure used to obtain the parameter choices used here.
>>> from sklearn.neural_network import MLPClassifier
>>> clf = MPLClssifier(solver='lbfgs',alpha=1e-5,hidden_layer_sizes=(20,10),random_state=1)
Now we can train the classifier using the training set we created above. Please notice we have two hidden layers, one has 20 neurons and one has 10 neurons in this case. However you can always change the number of layers and numbers of neurons in each layer.
>>> clf.fit(X_train, y_train)
That's it! Now that the model has been trained on our data, we can use it to predict the facies of any well with the same set of features as our training set. We set aside a well for exactly this purpose.
Evaluating the classifier
To evaluate the accuracy of our classifier we will use the well we kept for a blind test and compare the predicted facies with the actual ones. We need to extract the facies labels and features of this data set and rescale the features using the same parameters used to rescale the training set.
>>> y_test = test_well[‘Facies’]
>>> well_features = test_well.drop([‘Facies’, ‘Formation’, ‘Well Name’, ‘Depth’], axis=1)
>>> X_test = scaler.transform(well_features)
Now we can use our trained classifier to predict facies labels for this well, and store the results in the
Prediction column of the
>>> y_pred = clf.predict(X_test)
>>> test_well[‘Prediction’] = y_pred
Because we know the true facies labels of the vectors in the test data set, we can use the results to evaluate the accuracy of the classifier on this well.
>>> from sklearn.metrics import classification_report
>>> target_names = [‘SS’, ‘CSiS’, ‘FSiS’, ‘SiSh’, ‘MS’, ‘WS’, ‘D’,‘PS’, ‘BS’]
>>> print(classification_report(y_test, y_pred, target_names=target_names))
Our classifier achieved an overall F1 score of 0.52 on the test well, so there is room for improvement. It is interesting to note that if we count misclassification within adjacent facies as correct, the classifier has an overall F1 score of 0.97.
Let's look at the classifier results in log-plot form. Figure 4 is based on the plots described in Alessandro Amato del Monte's excellent tutorial from June 2015 of this series. The five logs used as features are plotted on figure 3.
Instead of only having results from neural networks, I also repeat the work Brendon Hall has done, which is apply another machine learning algorithm-support vector machine (SVM)-to the same datasets. Based on the statistic results from Neural Network and support vector machine, the true facies, SVM predicted facies,neural network predicted facies are plotted on figure 4. We can easily tell that neural network did a better job in facies classification.
This tutorial has provided a brief overview of a typical machine learning workflow: preparing a data set, training a classifier, and evaluating the model. Libraries such as
scikit-learn provide powerful algorithms that can be applied to problems in the geosciences with just a few lines of code.
- Dubois, M. K., G. C. Bohling, and S. Chakrabarti, 2007, Comparison of four approaches to a rock facies classification problem: Computers & Geosciences, 33, no. 5, 599–617, http://dx.doi.org/10.1016/j.cageo.2006.08.011
- Waskom, M., et al., 2016, Seaborn, v0.7, January 2016: Zenodo, http://dx.doi.org/10.5281/zenodo.45133
- Amato del Monte, A., 2015, Seismic petrophysics: Part 1: The Leading Edge, 34, no. 4, 440–442, http://dx.doi.org/10.1190/tle34040440.1
You can find all the data and code for this tutorial for yourself at https://github.com/seg.
If you want to learn more about machine learning, here is a good online course for beginners: https://www.coursera.org/learn/machine-learning