Detecting Organic Contaminants with Less Data

3 months ago 2

I recently wrote a spectroscopy paper with Dr. Thomas Beechem at Purdue University. It is currently in the process of being reviewed (total tangent, but did you know that most scientific publications concerningly do not ask for the codebase when reviewing a project with code?), but the preprint can be found here. I'm quite proud of it, but it is not immediately applicable to the average engineer not familiar with the spectroscopy. If I had read the paper in 2022 when I first joined Specere Labs, where I worked with Dr. Beechem, I would not have been able to understand what I recently sent out for review. That's a shame because I think it is an interesting concept.

This is why I decided that, to go along with the paper, I would write a blog post! This blog post is designed to be understandable to someone with a basic understanding of math and engineering but not necessarily experience in the field of spectroscopy. It is intended to be understood by me if I had read it in 2022 as a newly minted graduate in computer science and engineering from UC Davis.

Compressed Sensing

Images are highly compressible.

We have known this for a long time. You can take the Fourier transform of an image, run it through a low-pass filter, and convert it back to an image with very little visible loss. If you don't even need to ensure a lack of visible quality loss and only need to ensure that the observer can be sure of what the image is of, you can compress it even further.

Image of a dog with compression applied to it from the Brunton book.

As you can see in the above image (stolen from this textbook, which I highly recommend), the image of the dog is only at all visibly degraded when 99.8% of the data is removed. And even then it is still clearly a dog!

The reason that images are highly compressible is that, given an n by n pixel image, there are very few combinations of pixels that actually convey any meaning. Most display screens are capable of making a pixel one of 256³ different colors because each of the three color channels, red, green, and blue, is represented by 8 bits (2⁸=256). This means that an n by n pixel image can display n²×255³ combinations, which is quite large for even a small n. But the vast majority of these images, do not actually provide any useful information to the human brain. This is why most QR-codes basically look the same at first glance even though they're actually quite different.

In other words, for any given pixel image with something meaningful on it, there are a whole lot of images that a human would view as very similar to that image. When a large basis space (a basis space is the total number of combinations) has only a few meaningful combinations, this is known as sparse representation. Sparse representation is very compressible.

Spectroscopy

Spectroscopy is an entire field of science devoted to deducing information from the examining the different colors of light hitting a sensor, which is typically a camera. While I won't try to describe the methods or hardware of spectroscopy in only a few paragraphs, I can try to explain the parts of spectroscopy relevant to this publication.

Spectroscopy can detect what chemicals are in a substance since different wavelengths (i.e., colors) are absorbed depending on the substance. You do this by shining light through the substance in question and detecting which waves are absorbed and which are transmitted/reflected. The energy of light that was shined on the substance is plotted against how much absorption took placed to make a spectrum unique to a chemical. It's kind of like a fingerprint. We used IR (infrared) spectroscopy data, which means that the frequencies of the light shone through the chemical were all infrared wavelengths. Humans cannot see infrared, but the infrared set of wavelengths is good at distinguishing different chemicals from each other. We got our spectra data from the National Institute of Standards and Technology (NIST), which is a government institution famous for selling human fecal matter. It also sells other things, such as very detailed and accurate IR spectra. The below image is NIST's IR spectra representing acetone.

The Specific Domain

Now I'm sure you can see where this is going. We compress the spectra and determine what substance we are studying with less data! However, spectra are not as information-sparse as pixel images, so taking the Fourier transform and running the result through a low-pass filter detracts from the accuracy. It has also been done before, and while it has a use case (cheap spectroscopy systems that don't need to be as accurate), that was not what we were trying to do.

However, there is a case where spectra are sparse, and that is when the domain is known. It is rare to have absolutely no idea what substance you are dealing with, so you don't have to consider the possibility that the substance could be anything. If you are testing water that may have lead, you know what the spectrum of water looks like and what the spectrum of lead looks like. Even if the water might have other contaminants in it, such as lime or other common minerals, there is a set of spectra that you can expect. You aren't going to find large quantities of cyanide in drinking water (hopefully), and so that doesn't need to be taken into account, and if it does, you have a bigger problem.

By limiting the domain, we can optimize for both efficiency and accuracy, since the trade-off is the ability to generalize. This could practically be useful if a person has a specific domain and needs an accurate and cheap contaminant detector.

Creating a Basis

In our case, we were interested in mixtures of volatile organic compounds (VOC’s) that are often used in industrial processes. Specifically, the goal was to see if we could differentiate between the VOC's and also identify contamination of one VOC by another. While this algorithm would hopefully work for any type of chemical, we had limited time to run our experiments, and our methods section couldn't be infinitely long. So while I ran some tests on my laptop where I applied this method to a number of chemicals and it did quite well, the huge, robust test that we ran on Purdue's HPC (which was named Negishi) focused on the chemical acetone. With respect to the latter, we used acetone as a case study owing it to both its common use and the fact that it exhibits an IR spectrum that is 'unremarkable' relative to the datbase as a whole. That's a fancy way of saying, we picked a normal spectrum to really test how good the approach performed.

The key to this algorithm working is that we know what the total sum of possible spectra could be. We can define the basis. That is the point of restricting the domain. Since we know the basis, we can make a matrix that represents it. The basis matrix consists of one row per spectrum. So let's say you have acetone, which could be contaminated with either isopropanol or ethanol. You have a row representing the spectrum of a mole of acetone, a row with a mole of acetone and 1 µmole of isopropanol, a row with a mole of acetone and 2 µmoles of isopropanol, etc. Then you do the same with ethanol. Technically this means that there is no row for, say, acetone with 1.5 µmoles of isopropanol, but we found that this doesn't really matter, and acetone contaminated with 1.5 µmoles of isopropanol looks quite similar to 1 µmole of isopropanol or 2 µmoles of isopropanol. And in any event, we are not trying to detect the number of moles of the contaminant; we are just trying to figure out if isopropanol is messing up my acetonere, so as long as it looks more like acetone contaminated with isopropanol than acetone contaminated with ethanol, it is all good.

Now this does result in a rather large basis matrix. We used NIST's "NIST Quantitative Pollutants" database, which contains 40 chemicals, and we tested 16 different levels of contamination (if you are curious, it was 100, 80, 60, 40, 20, 10, 8, 6, 4, 2, 1, 0.8, 0.6, 0.4, 0.2, and 0.1 µmole per mole). This was a basis matrix with 40 × 16 = 640 rows. Since our whole goal is to make it sparse, this seems like far too many rows to make it sparse!

But the thing is, most spectroscopy systems have essentially infinite basis spaces. They don't use a basis matrix—there are several other methods for determining the spectra which are all FAR more computationally expensive than matrix multiplication, which is what we end up doing. This means we can make the basis matrix pretty big and not have to worry about being too computationally expensive. Plus, computers are really good at matrix multiplication, especially now with machines being optimized for graphics and AI.

Non-Negative Matrix Factorization

A visualisation of the V, W, and H matrices

In the above image (I can't take credit for it, Dr. Beechem made it), the V matrix is the basis matrix. It consists of all the spectra we created stacked on top of each other. This whole big matrix can be represented in a compressed way through two matrices (W, H) deduced using a numerical technique called Non-Negative Matrix Factorization (NMF). Using NMF, we represent our 'big data' smaller using two matrices constrained to be positive since real hardware only gives positive numbers. Sensors read the presence, rather than the lack, of light. The H matrix, which represents the 'eigenspectra', provides a smaller set of spectra that can be used to 'make' all the data in the big matrix via their mixing. Mathematically, the amount of mixing is dictated by the weights, W. The big idea here is that if we make filters that have a 'pass-band' (i.e., let's light through) in a way that matches each row of H, the intensity of light hitting the detector is effectively W. This means I can reconstruct a spectrum with far fewer measurements since I know how H and W map to the real data (which is in this case the big matrix). The question becomes how few rows of H do I need to tell if my acetone is contaminated.

The key here is that you can select how many dimensions you want the W and H matrices to have. Fewer dimensions means the output is more data-lean, but less accurate. We ran some tests and found out that you can compress it to ~50 dimensions and get pretty good results. We consider 'pretty good' to be at least 97% correct predictions about which contaminant is found in the acetone. This is promising because the above image is actually slightly old. It was from when we used a different NIST codebase and the spectra had 880 wavelengths recorded. The new ones from the test set we use have 3516 frequencies for which wavelengths are reported. So that is about 1/70th of the original data size, which is really good!

I don't have to outrun the bear

So how good is "good?" The results are actually pretty promising. Even when noise is added, the algorithm can get accurate results. Below is figure 5 from the paper.

A visualisation the accuracy of our algorithm given noise and contaminations

This is promising! How well can the weights and the eigenspectra recreate the original spectra?

A visualisation the our algorithm's ability to reconstruct spectra

This is also pretty good! However, it won't always be that good. There may be noise added, or a chemical not predicted to be a contaminant, or some other issue we haven't predicted. Our method needs to be robust in those situations since we are trying to create a real practical tool, not just a mathematical abstraction.

The good news is, the weights don't have to match up perfectly. Let's say you take light (which is like a single row of the V matrix), shine it through the filter (which is like the H matrix), and you get a weight (which is like a single row of the W matrix). The single row of the W matrix doesn't have to match up precisely with the given weight. It is compared to all the weights in the weight database using MSE. It doesn't have to be all that similar to the weights; in fact, it could have a very high MSE! I just have to have a lower MSE when compared to the correct weight than the MSE with the other weights. It is like the two people running from the bear—they don't need to outrun the bear; they just need to outrun each other.

Anyway, I hope this was helpful and somewhat informative! I wish research were more accessible, which is why I'm doing this. I don't think that the current situation with NIH funding is entirely due to researchers not caring about accessibility, but it certainly doesn't help. Once the paper is published, I'll publish the codebase, and y'all can look at the Python notebooks. Also, to reiterate, reviewers do not look at the code. It could be full of bugs, and nobody would know. It's not full of bugs, and my professor looked at the code and confirmed it was indeed not full of bugs, but the fact that they have so little oversight for ensuring that people don't just submit garbage code is insane. I am 25 years old and wrote the code at age 23. Nobody should be taking my word that my code is not full of bugs.

Read Entire Article