Ever wanted to master the art of NIR calibration but got lost in the acronyms? Struggling with getting simple explanations of the basics? Well then, this post is for you and today we’ll work through two scatter correction techniques for NIR spectroscopy in Python. We’ll discuss functions to perform **Multiplicative Scatter Correction (MSC)** and **Standard Normal Variate (SNV)** on near-infrared (NIR) data. Three acronyms in one sentence, that’s a good start.

Data pre-processing is an essential step to build most (ahem, all) types of calibration models in NIR analysis. With a well-designed pre-processing step, the performance of the model is greatly improved.

Before we move on, you want to take a look at some of our past post on the topic, here you go.

Pre-processing techniques can be generally divided into **1) Scatter Corrections** techniques and **2) Derivative** techniques. Both MSC and SNV belong to the first category.

In this post we are going to introduce the problem of scatter correction, describe the details and write some code to perform both MSC and SNV, then apply both method on some real data.

NIR spectra contain a mix of diffuse and specular reflectance (or straight transmittance). Three main factors affecting the shape of each spectrum are:

- Different wavelengths of the incident light experience different
**absorption**by the sample, due to the chemical nature of the sample itself. In most cases this is the signal we want to measure, and it relates to the analyte of interest. - Differences in particle size in the material will cause light to be deviated at different angles depending on it wavelength.
**Scattering effects**(particle size), along with possible differences in path length constitute the major cause of variations in NIR spectra. - Path length differences from sample to sample due to positioning and/or irregularities in the sample surface.

Scattering effects can be both additive and multiplicative. **Additive effects** (such as path length differences) produce a baseline displacement of the spectrum along the vertical axis, while **multiplicative effects** (particle size for instance) modify the local slope of the spectrum.

The idea behind scattering corrections is to get rid of all effects that are unrelated to the chemical nature of the sample, but just depends on the sample morphology and measurement geometry. So, the idea goes, if we are able to get rid of these undesirable effects beforehand, we should get a better model for the quantity of interest.

As always, that is easier said than done. In practice it may be extremely difficult to separate scattering from absorbance effects, and the methods developed by the community tend to be an approximation that is valid under some assumptions.

Having said that however, years of practice showed that both MSC and SNV often do a good job in improving the quality of the calibration model and indeed are two of the most common, yet simple, pre-processing techniques for NIR data.

Let’s dive into it!

MSC requires a reference spectrum to work. This is the most important difference between MSC and SNV. The reference spectrum is ideally a spectrum free of scattering effects.

Now, as you can gather, getting our hands on a spectrum that is free of unwanted scattering effects is not easy, definitely not across all wavelengths we are interested in. For this reason, if the data is reasonably well behaved, we can take the average spectrum to be a close approximation to an ideal spectrum. Particle size and path length effects should vary randomly from sample to sample, and therefore the average should reasonably reduce these effects, at least that’s what we hope. This is the main assumption behind MSC.

Mathematically, if we call \( X_{m} \) the mean spectrum, the correction is done in two steps.

- First we regress each spectrum \( X_{i} \) against the mean spectrum. This is done by ordinary least squares. \( X_{i} \approx a_{i} + b_{i}X_{m} \) .
- We calculate the corrected spectrum \( X^{\mathrm{msc}}_{i} = (X_{i}-a_{i})/b_{i} \).

In general all these spectra have a non-zero mean, and therefore we can optionally mean centre the spectra beforehand.

Here’s a Python function to perform MSC.

MSC

Python

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | def msc(input_data, reference=None): ''' Perform Multiplicative scatter correction''' # mean centre correction for i in range(input_data.shape[0]): input_data[i,:] -= input_data[i,:].mean() # Get the reference spectrum. If not given, estimate it from the mean if reference is None: # Calculate mean ref = np.mean(input_data, axis=0) else: ref = reference # Define a new array and populate it with the corrected data data_msc = np.zeros_like(input_data) for i in range(input_data.shape[0]): # Run regression fit = np.polyfit(ref, input_data[i,:], 1, full=True) # Apply correction data_msc[i,:] = (input_data[i,:] - fit[0][1]) / fit[0][0] return (data_msc, ref) |

Note that, in addition to the corrected spectra, it is good practice to return the mean spectrum too. That can be used on a second data set (for instance validation data) as a reference correction that is consistent with the first set of spectra.

Unlike MSC, SNV correction is done on each individual spectrum, and a reference spectrum is not required. The SNV correction can be divided in two conceptual steps as well.

- Mean centre each spectrum \({X}_{i} \) by taking away its mean \(\bar{X}_{i}\)
- Divide each mean centred spectrum by its own standard deviation: \( X^{\mathrm{snv}}_{i} = (X_{i}-\bar{X}_{i})/\sigma_{i} \).

Translated into Python the previous algorithm looks something like this

Python

1 2 3 4 5 6 7 8 9 10 | def snv(mat): # Define a new array and populate it with the corrected data data_snv = np.zeros_like(input_data) for i in range(mat.shape[0]): # Apply correction data_snv[i,:] = (mat[i,:] - np.mean(input_data[i,:])) / np.std(input_data[i,:]) return data_snv |

To test these functions, I’ve run the following code

NIR scatter correction test

Python

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import numpy as np import pandas as pd import matplotlib.pyplot as plt X = pd.DataFrame.as_matrix(pd.read_csv('peach_spectra.csv')) wl = np.arange(1100,2300,2) Xmsc = msc(X)[0] # Take the first element of the output tuple Xsnv = snv(X) ## Plot spectra plt.figure(figsize=(8,9)) with plt.style.context(('ggplot')): ax1 = plt.subplot(311) plt.plot(wl, X.T) plt.title('Original data') ax2 = plt.subplot(312) plt.plot(wl, Xmsc.T) plt.ylabel('Absorbance spectra') plt.title('MSC') ax2 = plt.subplot(313) plt.plot(wl, Xsnv.T) plt.xlabel('Wavelength (nm)') plt.title('SNV') plt.show() |

The data set is the same that I’ve used in past articles. It comes from a NIR reflectance experiment on peaches taken with the Brimrose Luminar 5030. The output is

Indeed the two methods give an almost identical result, at least in this case. As it has been first shown by Danoha* et al.* in this paper (if you can’t get a copy, feel free to contact me), the two methods are indeed related by a linear transformation.

In closing, one of the advantage of MSC is that can relate all spectra to a common reference. If the reference spectrum is, to a good approximation, close to a spectrum free of unwanted scattering, this is a pretty good choice. Conversely, if outliers are present then the mean spectrum may not be a good reference and the MSC is not a good choice. In these cases SNV can be opted for instead.

Thanks for reading and until next time!

Physicist and an entrepreneur. Founder and Managing Director at Instruments & Data Tools, specialising in optical design and analytical instrumentation. Founder at Rubens Technologies, the intelligence system for the fresh fruit export industry.