Fourier lightfield multiview stereoscope for large field-of-view 3D imaging

3 hours ago 2

1.

Introduction

Stereo vision has been used in surgical settings for over 100 years, at least since 1922, when Gunnar Holmgren first used a binocular operative microscope for an ear procedure.1 Since then, stereoscopes and stereo vision have been incorporated into both the practice and teaching workflows of microsurgeons in wide-ranging fields, including ophthalmology,2 neurosurgery,3 and otolaryngology,4 as well as in research settings, such as stereotaxic rodent surgery.5 Recent decades have seen the rise of digital stereoscopes alongside their analog counterparts. Digital stereoscopes, as well as related stereo-exoscopes, simultaneously capture two digital images of a surgical area from unique perspectives for a separate display to each eye of the operator.2 Heads-up stereoscopic displays provide ergonomic and teaching benefits and have enabled 3D vision in fields where analog stereoscopes cannot be easily deployed, such as minimally invasive surgery.6

Beyond the qualitative depth perception that analog and digital stereoscopes provide to human operators, quantitative 3D scene information is becoming increasingly critical in surgical settings. Automated and robotic surgical systems may rely on 3D cues for tool positioning5,7 and monitoring tissue surface deformation.810 Quantitative 3D feedback can also assist in the placement of handheld surgical tools. For instance, researchers performing stereotaxic rodent surgery frequently need to perform injections or place implants within specific regions of in vivo rodent brains, which are challenging and error-prone procedures.11,12 Precise and quantitative 3D feedback on the tool placement within the surgical area could improve the success rate of such procedures and increase their accessibility to labs without highly trained surgeons. However, accurately computing quantitative depth at the microscale is challenging with current digital stereoscopes, which provide just two unique views.

Numerous methods beyond stereoscopes have been developed in an attempt to provide depth information during microsurgery. First, in many procedures, sample alignment is performed prior to the start of surgery, and 3D position is tracked throughout the procedure based on the known anatomy or structure of the sample. For instance, in rodent stereotaxic surgery, the skull is mechanically aligned with a stereotaxic apparatus,13 and regions of interest can then be localized using a known atlas of the rodent brain. However, this process is time-consuming and requires a high level of expertise, leading to high failure rates.5,12 To improve streotaxic procedures, haptic feedback from robotic systems can assist surgeons in navigating 3D space.1416 Although these systems are precise, they can only be employed in specific confined settings and must follow pre-surgical alignment. To measure 3D tissue deformations in real time, optical coherence tomography (OCT) is utilized in microsurgery.4,1719 OCT, either as a standalone modality or to supplement or guide digital stereoscopes, can operate with high precision and accuracy to distinguish between thin layers of tissue and visualize small vascular structures. Unfortunately, OCT typically has a limited field-of-view (FOV), forcing surgeons to frequently re-position their imaging system during surgery. The provided images may also be less intuitive for surgeons compared to conventional RGB images.20

Recently, novel computational microscopes have been developed to capture video from >2 unique perspectives to increase the accuracy of high-resolution quantitative 3D measurements. Lightfield21 and Fourier lightfield microscopes (FLFMs),22 for example, capture up to dozens of unique perspectives to robustly solve for specimen depth in the presence of various lighting conditions at each pixel. Unfortunately, such methods currently face an inherent trade-off between spatial and angular resolution, which generally results in 3D images with low pixel counts—far fewer than what is desired in microsurgical settings, where high resolution within a large FOV is the target criteria (e.g., for display on current 4K or upcoming 8K monitors).

In this work, we address these challenges with a novel camera array-based Fourier lightfield multiview stereo scope (FiLM-Scope), which uses the principles behind FLFM alongside a high-density array of compact cameras to simultaneously stream 48 high-resolution, large pixel-count viewpoints over an angular range of 21  deg×29  deg. A unique post-processing pipeline then combines these views into a 12.8 megapixel (28×37  mm FOV, 22  μm lateral resolution) 3D video, which allows for digital zooming and panning without the need for manually moving or adjusting the scope during surgery. We demonstrate the utility of the system for microsurgical settings by tracking the 3D movement of surgical tools over the surface of an ex vivo rat skull and showing video-rate deformation in stretching human skin.

2.

Background

Before describing the details of our new approach, we first provide additional background on related existing technology.

2.1.

Depth-From-Stereo

Depth-from-stereo extracts quantitative height information from two or more images of an object. This large body of work includes both conventional (typically feature extraction and matching) and deep learning solutions.23 There has been extensive research into optimal methods for performing depth-from-stereo in surgical settings, which has been successfully demonstrated in image-guided surgery.6,24,25 Regardless, surgical stereo vision has major drawbacks. Optical features are difficult to identify and match in sparse, low-contrast regions, which is a major challenge when imaging soft tissue. Occlusions, speckles, specular reflections, and other artifacts also drastically reduce the effectiveness of feature matching between stereo image pairs.9,24

As a result, although analog and digital stereoscopes are featured in many microsurgery workflows to provide qualitative depth perception to human operators,26,27 they are rarely used to provide quantitative 3D information, as they do not typically have the precision to resolve micron-scale surgical targets, including retinal membranes, fragile brain tissue, and small blood vessels.19 Depth-from-stereo has been employed in microsurgery to estimate the 3D pose of macroscopic rigid objects—such as digital alignment of rodent skulls prior to or during surgery5,28 and tool tracking during surgery.7,29,30 However, estimating the 3D structure of deformable tissue surfaces is considerably more challenging. As an example, several prior works have attempted 3D reconstruction of retinal surfaces from stereo images, which required prior assumptions of the shape and curvature of the retina that limit generalizability and prevent tracking of tissue deformations during surgery.29,30 In this work, we demonstrate how the FiLM-Scope can accurately measure the 3D height of complex, deformable tissue surfaces while jointly tracking surgical tools with down to 11  μm accuracy, thus offering a suitable measurement system for real-time 3D guidance within microsurgical settings.

2.2.

Multiview Stereo

In macroscale 3D imaging, many of the limitations of traditional binocular stereo imaging have been overcome through the use of many cameras, otherwise known as multiview stereo (MVS).31,32 Compared with binocular stereo or depth-from-stereo using a small number of cameras, MVS can provide depth information with greater accuracy and precision. The use of many cameras allows for greater angular coverage, and the redundancy of the images can overcome challenges with noise, reflection, occlusions, and texture-sparse regions.31 With the rise of self-supervised machine learning approaches, 3D reconstruction from MVS images can be performed without extensive training data or hand-crafted features.3335 As microscopic imaging with multiple cameras in surgical settings is practically challenging, limited prior work has applied MVS within microsurgery.

2.3.

Lightfield Microscopy

As noted above, FLFM can form multiperspective images from a single image sensor. FLFM, as well as similar systems published under the names extended-LFM (XLFM)36 and Fourier integral microscopy (FI-MIC),37,38 uses a cascade of a microscope objective lens, relay lenses, and a microlens array to form multiple images of a sample from different perspectives onto one digital image sensor. FLFM has been used to perform high-speed 3D imaging of microscopic samples, including human organoid dynamics,39 live cell processes,40,41 and neural activity in small organisms.36,42 However, scaling FLFM for surgical imaging over larger FOVs presents several key challenges. Most FLFM systems form many images of an object on a single sensor, so the number of available pixels must be divided between the number of images formed. This leads to unacceptably low resolution when acquiring a large number of angular perspectives on conventional sensors, while adopting new sensors with several hundred or more megapixels (MP) is not an option when video-rate imaging is required. The camera array-based light-field microscopy (CALM) system successfully demonstrated an FLFM system using multiple imaging sensors.43 However, the CALM system was not designed for quantitative video-rate 3D profiling of dynamic surfaces as is demonstrated here.

3.

FiLM-Scope Hardware and Algorithm Design

3.1.

Optical Design

The FiLM-Scope optical design consists of two main components: the “primary lens” (analogous to the “Fourier” lens of conventional FLFM setups22) and a compact multicamera array microscope [MCAM, Fig. 1(a)], placed together in a 4f configuration [Fig. 2(a)]. Our utilized MCAM4447 consists of 48 microcameras, tightly packed into a 6×8 grid with 9 mm inter-camera spacing. Each microcamera includes its own custom multi-element lens and 12.8 MP CMOS sensor, all of which are triggered synchronously from a shared FPGA (S7 in the Supplementary Material). In the FiLM-Scope configuration, the 48 cameras concurrently acquire high-resolution images of our sample from 48 unique perspectives, giving a total of 600 MP per snapshot [Fig. 1(b)]. The MCAM can stream at 5 gigabytes (GB) per second, enabling video at ∼8 frames-per-second (FPS) full resolution, or up to 120 FPS with 4×4  pixel binning.

Fig. 1

FiLM-Scope configuration and example images. (a) Top: picture of multicamera array microscope (MCAM) lenses. 48× lenses are held in a single block, with 9 mm pitch. Bottom: 3D model of FiLM-Scope. (b) Snapshots from the FiLM-Scope of an ex vivo rat skull. 48 images are acquired in a single frame, allowing for video acquisition at up to 120 frames per second. The green and purple insets highlight the angular range of the system, whereas the red inset highlights the high resolution. (c) 3D height map from the FiLM-Scope. Top left: single image of the rat skull from the FiLM-Scope. Top-right: height map for the rat skull. Bottom: skull displayed as a 3D point cloud, generated using the height map. (d) Four frames from a video of a hand-held tool moving over the surface of a rat skull, acquired at 100 frames per second. 3D reconstruction was performed on all four frames, and the resulting point clouds are overlayed in the bottom frame, highlighting the motion of the tool in 3D space.

Fig. 2

FiLM-Scope calibration results. (a) Diagram of ideal parametric FiLM-Scope system, demonstrating how 3D information is encoded in the images. Red points lie on object plane and appear at the same pixel location in both highlighted elemental images. Blue points are axially translated off the object plane. On the center elemental image, the blue points project to the same image location as the red points (forming purple points). In the upper elemental image, the projections of the blue points are laterally translated relative to the projections of the red points. (b) Example FiLM-Scope calibration results, showing shift slopes (Si) for all 48 cameras (all units: pixels/mm). Top: average shift slope value for each camera. Bottom (red and blue insets): Varying Si(p) values for two cameras across the image fields-of-view. (c) Top: on-axis, diffraction-limited axial resolution for each camera with a logarithmic scale bar. Bottom: on-axis, diffraction limited lateral resolution for each camera. (d) Digital refocusing results with raw reference camera image (top) and digital refocus back-projections of all 48 images to the respective axial planes (0, 6.4, 8.0, and 10.7 mm) using data from a snapshot in Fig. 1(b). A full digitally refocused volume is shown in Video 1 (Video 1, Mp4, 19.4 MB [URL: https://doi.org/10.1117/1.APN.4.4.046008.s1]).

To utilize all 48 cameras, the primary lens must have an aperture larger than the diameter of the MCAM (90 mm). In addition, the f-number of the lens (the ratio between its focal length and diameter) should be as small as possible to improve lateral and axial resolution. When designing the FiLM-Scope, we used the definition for axial resolution detailed by Guo et al.22 and Galdón et al.,48 which provides a resolution metric for the optical system, independent of the reconstruction algorithm. Depending on the sample type and reconstruction algorithm, the true axial precision can fall above or below this value. For a given microcamera i, the axial resolution (ϕax,i) is the smallest axial separation at which two laterally aligned points are separately resolved in camera i. In our setup, this is given as

Eq. (1)

ϕax,i=ϕlat,itanθi,

where ϕlat,i is the lateral resolution in camera i and θi is the angle between the optical axis and the chief ray for camera i. The axial resolution of the system is then reported as the finest resolution among all the cameras [Fig. 2(c)]:

Eq. (2)

ϕax=ϕlattanθmax≈ϕlatδmax/F,

where δmax is the distance between the optical axis and the array lens furthest from the axis and F is the focal length of the primary lens (see Table 2, and S1 in the Supplementary Material). In reality, the FiLM-Scope can provide far more precise 3D results than this value would suggest, depending on the sample and the algorithm used in 3D reconstruction. Regardless, this provides a useful benchmark for assessing the system’s optical design.

To ensure our system has an adequate axial resolution, we chose the Zeiss R-Biotar 100 mm f/0.73 lens as our primary lens, which was originally designed to image an X-ray system’s scintillator response onto a large format film. Because this lens is not fully infinity corrected, the FiLM-Scope suffers from spherical and coma aberrations, resulting in a significant resolution fall-off toward the edges of all the images, as well as throughout the FOV for cameras on the edge of the array [Fig. 2(c), and S1 in the Supplementary Material]. In addition, each of the 48 images has a unique distortion pattern, which is compensated for in our calibration procedure. In future iterations of this system, a custom-designed primary lens can further improve overall system performance. The specifications for all lenses are given in Table 1.

Because 3D information encoding in FLFM relies on the lens configuration rather than illumination patterns, the FiLM-Scope can be employed with a range of illumination setups. Images shown in this work were acquired with one of two illumination setups: exterior epi-illumination with a flexible LED strip affixed to the front of the lens, or internal epi-illumination from directly behind the primary lens via a large ring illuminator, both of which are detailed in S8 in the Supplementary Material.

Table 1

System specifications. The array lenses are custom-designed lenses from Edmund Optics, designed to fit in the tight 9 mm pitch of the MCAM sensor board. The lenses and sensors are packed into a 6×8 grid, which determines DMCAM and δmax. The primary lens is the Zeiss R-Biotar f/0.73 lens, originally designed for use in an X-ray system.

NameSymbolValue
Primary lens focal lengthF100 mm
Primary lens diameterD100 mm/0.73=137 mm
Array lens focal lengthf14.64 mm
Array lens diameterd5.7 mm
Pixel pitchρ1.1  μm
MCAM lens spacingp9 mm
Distance from optical axis to furthest lensδmax((6−1)·p)2+((8−1)·p)2/2=38.7  mm
MCAM diameterDMCAM(6·p)2+(8·p)2=90  mm

3.2.

Calibration

In Fourier lightfield imaging, each array lens receives rays emitted from the object from a unique range of angles. As a result, when an object shifts axially away from the focal plane (Δz), its projection in each image shifts laterally [Fig. 2(a)]. In an ideal, paraxial system, the magnitude of that shift in pixels will be given by

where M is the system magnification, θi is the angle between the chief ray entering lens i and the optical axis, and ρ is the pixel pitch. Because points at the object plane should appear at identical pixel locations in every elemental image (an image formed by a single camera in the array), we can calculate the distance of a point from the object plane by finding the disparity between its projection in any two elemental images. Specifically, if a point appears at pixel pi in image i and pj in image j, we can find its axial location z with the following equation:

Eq. (4)

z=ρ(pi−pj)M(tan θi−tan θj).

This concept forms the basis of projection between image and object space in FLFM, which is utilized in nearly all 3D reconstruction approaches.

Due to the distortions that arise from using a large primary lens and small misalignment errors, our system differs from the ideal system in several key ways. First, both M and θ may vary within a single elemental image. Second, we cannot assume that samples at the object plane will project identically to each elemental image. Thus, we need a calibration model that can capture these deviations. Due to the black box nature of our primary lens, we chose to use an experimental calibration scheme.

In our calibration model, each camera i is described by two pairs of equations. The first pair, Si,x(p) and Si,y(p), describes the ratio between the axial shift of a point in object space (Δz, given in millimeters) and the lateral shift of its projection in image space (Δpi, given in pixels), which we will refer to as the “shift ratio.” The second pair, Oi,x(p) and Oi,y(p), describes the shift in pixels between where a point on the object plane appears in image i and where it appears in a chosen reference perspective, which we will call the “inter-camera shifts.” All four equations are parameterized by the coefficients for a second order, 2D polynomial equation, where the input is a 2D pixel location, p=(px,py). Thus, our full calibration model consists of 48×2×2 polynomial equations, which together can fully describe the forward projection from object space into the image space.

Si,x(p) and Si,y(p) are calibrated separately for shifts in the x and y dimensions, without explicitly estimating M or θ [Fig. 2(b)]. Specifically, for a given pixel location pi in camera i

Eq. (5)

Si,x(pi)=Δpi,x/Δzx,

Eq. (6)

Si,y(pi)=Δpi,y/Δzy.

We experimentally fit the coefficients for Si(p) for each camera by imaging a graph target at five or more distinct axial planes, measuring how far each graph vertex shifts between images, and fitting a line to those shifts. Next, we fit the coefficients of Si(p) to give the slopes of those lines across the image FOV. We similarly fit Oi(p) by imaging the graph target at the object plane, then measuring the shift in pixels between the locations of the graph vertices in image i and the locations of the same vertices in the chosen reference image. More information is given in Appendix B.

After calibration, the parametrized model of our system can be applied to a wide range of 3D reconstruction algorithms. As a first example, we can generate a digital z-stack by back-projecting all of the images onto a chosen z plane, which brings into focus all points lying at the selected plane and blurs all other regions, akin to the “shift-and-sum” or “digital refocusing” operation in traditional lightfield imaging49,50 [Fig. 2(d), Video 1]. This z-stack can be used to qualitatively visualize 3D structures in the imaged object.

3.3.

3D Reconstruction Algorithm

To obtain dense, quantitative height information, we adopted a similar approach as demonstrated in prior algorithms like MVS-Net, which was designed for MVS 3D reconstruction of macroscale objects.3335,51 Our unsupervised optimization algorithm is outlined in Fig. 3. First, all 48 multiperspective images are back-projected through the expected depth range of the sample to create a digital z-stack. The z-stack is oriented to the perspective of a single reference camera, which we choose to be one of the four center cameras in the array. We will refer to this perspective as the “reference view.” The z-stack is then fed through a 3D U-Net, which outputs a probability volume of the same size, describing the likelihood of the sample’s surface passing through each voxel in the volume. For each x,y location in the volume, the values along the z axis sum to 1, so we can extract a depth map from the volume by multiplying each plane in the volume by its corresponding height, and summing all the planes (see Appendix D for more details).

Fig. 3

Diagram for self-supervised 3D reconstruction algorithm developed for the FiLM-Scope. Images are first back-projected into a 3D volume, using our custom calibration approach. The volume is then fed through a 3D U-Net, outputting a probability volume from which we can extract a height map. All images are rectified to a reference viewpoint using this height map, and the loss between each of the rectified images and the “reference image” (acquired from the reference camera) is used to update weights on the U-Net.

A loss is computed using the depth map to rectify all 48 images to the reference view and subsequently taking the structural similarity between the rectified images and the original image acquired from the reference camera. This loss is used to update weights on the U-Net. The same z-stack is then iteratively fed through the U-Net while updating the weights until the algorithm converges. The final depth map can then be Gaussian or median filtered to further reduce noise.

4.

Results

4.1.

Optical Specifications

The FiLM-Scope achieves a FOV of up to 28.2×37.1  mm, with on-axis lateral resolution and depth-of-field (DOF) for the central camera of 21.6  μm and 3.1 mm respectively, measured by taking the line-spread-function across a USATF resolution target (S1.2 in the Supplementary Material). The system does experience aberrations toward the edges of the FOV, which are described in S1.5 in the Supplementary Material. Using the definition of axial resolution provided in Guo et al.,22 our optical design has an on-axis diffraction-limited axial resolution of 82.1  μm [Fig. 2(c)]. Although Fourier lightfield imaging does provide generally consistent 3D precision within the imaging volume,22 the optics-dependent axial resolution is proportional to the lateral resolution [Eq. (1)], so the axial resolution will vary across the FOV and within the DOF (S6 in the Supplementary Material), with the system aberrations. Because the true system precision is also dependent on the sample type and reconstruction algorithm, we can produce 3D results with greater precision than this theoretical value. Our full reconstruction pipeline has yielded results with down to 11  μm standard error (Figs. 4 and 5). The governing equations and theoretical and measured specifications are summarized in Table 2, and the process for calculating these values is detailed in Appendix A and S1 in the Supplementary Material.

Fig. 4

Select results. (a) Images and height maps for an ex vivo rat skull. For each of the four crops, the left panel shows the grayscale image, and the right panel shows an overlay of the reconstructed height map and the grayscale image. The FiLM-Scope algorithm detects small height changes on the skull surface, down to fractions of a millimeter. (b)–(c) Grayscale images and height maps of human fingers. Insets highlight details of the reconstructed height map. In the bottom inset for each image, the global tilt in the height map is removed to clearly highlight 3D spatial resolution.

Fig. 5

FiLM-Scope video acquisition. (a) 3D tracking of points on a tool, as a human user moves the tool over the surface of an ex vivo rat skull. Top: select frames with marked points. Bottom: graph of estimated heights for each point during 8 s of acquired video, captured at 100 FPS, along with the angle of the tool, measured between two points on its surface. The full video is provided in Video 2 (Video 2, Mp4, 16.4 MB [URL: https://doi.org/10.1117/1.APN.4.4.046008.s2]). (b) Estimated height of tool translated axially on an automated stage away from an ex vivo rat skull. Our algorithm accurately estimated the height of the tool with 11  μm standard error. (c) Select frames from a video and corresponding height maps of human skin. Video was acquired with 4×4  pixel binning at 100 FPS. In the right-most insets, global tilt was removed from the height maps to highlight fine features. The full video is shown in Video 3 (Video 3, Mp4, 38.2 MB [URL: https://doi.org/10.1117/1.APN.4.4.046008.s3]).

Table 2

Major system performance metrics are summarized here. All metrics excepting angular range can be reported separately for each camera. Here, values for magnification, lateral resolution, and depth of field are reported for the center camera, and the mean and standard deviation across all cameras. Axial resolution is reported for the camera with the smallest value as that determines the axial resolution of the system as a whole. A wavelength value of λ=530  μm was used in the relevant calculations. More details are given in S1 in the Supplementary Material.

SymbolEquationTheoreticalCenter/SystemMeanStd. Dev.
MagnificationMf/F0.1460.12150.12120.0004
Numerical apertureNAtan−1(d/2F)0.028
Diff. lateral resolutionϕlat,diffλ/(2·NA)9.30  μm21.63  μm23.43  μm2.57  μm
Geo. lateral resolutionϕlat,geo2ρ/M15.03  μm18.10  μm18.15  μm0.05  μm
Diff. axial resolutionϕax,diffϕlat,diff/(δmaxF)24.0  μm82.1  μm
Geo. axial resolutionϕax,geoϕlat,geo/(δmaxF)38.8  μm53.5  μm
Depth of fieldDOFn2−NA2NA2·λ652  μm3.08 mm3.13 mm0.05 mm
Angular range2·tan−1(δ/F)(25.4 deg, 34.0 deg)(21.4 deg, 29.4 deg)

4.2.

Height Maps

Applying our 3D reconstruction algorithm (Fig. 3) to snapshots acquired with the FiLM-Scope, we can distinguish the heights of fine features within the DOF of the system while also reconstructing large structures well outside the DOF, up to 1 cm. Example height maps are shown in Fig. 4. Figure 4(a) shows small height changes on the surface of a rat skull. In these maps, global tilt was removed to better show the precision of the results. The height map for a full rat skull is shown in Fig. 1.

Figures 4(b)4(c) show height maps for human fingers. From these results, we can see both the curvature of the finger over a depth of about 1 cm and the heights of small cracks and height changes on the skin surface. Additional reconstruction results with ground truth samples are included in S6 and S9 in the Supplementary Material, and a comparison against OCT data is shown in S5 in the Supplementary Material.

4.3.

3D Video Acquisition and Reconstruction

The FiLM-Scope acquires all information needed for 3D reconstruction in a single snapshot, so we can acquire 3D video at the full frame rate of the MCAM. Currently, the streaming rate of the MCAM is ∼5  GB/sec, which can be flexibly allocated between the number of cameras used, the streaming pixels per camera, and the system frame rate. For example, we can stream from all 48 cameras at full resolution at ∼8  FPS, or from 48 cameras with 4×4  pixel down-sampling at ∼120 FPS. The frame rate can be further increased by streaming from only a subset of the 48 cameras. After acquisition, we perform 3D reconstruction on the first video frame using the previously described algorithm (Sec. 3.3). We then sequentially reconstruct each subsequent frame by fine-tuning the network from the previous frame. This approach drastically speeds up reconstruction time compared with starting from a randomly initialized U-Net for each frame.

To demonstrate the utility of our approach for 3D visualization of tissue surfaces and tool movement during microsurgery, we used the FiLM-Scope to capture 3D video of tools moving over the surface of an ex vivo rat skull. We first acquired a video at 100 FPS with 4×4  pixel down-sampling while a human user freely moved the tool in 3D space and tapped the surface of the skull. After performing 3D reconstruction, we manually selected five points along the tool in the first frame from the reference camera and used optical flow to track these points throughout the video. We then extracted the heights of these points from the corresponding locations in the height map. Plotting these results [Fig. 5(a)], we can see the variation in the height and angle of the tool as the user lowers the tool to touch the skull. The 3D video from this experiment is provided in Video 2.

Next, we mounted the tool on an electronic stage to move it to known locations along the optical axis of the system. We acquired a full-resolution snapshot at each stage height and performed 3D reconstruction on all frames. We then selected a single point on the tool and extracted the height at that point in all frames, averaging over a 100×100  pixel region (corresponding to a 0.9×0.9  mm region of the tool). The plot of our estimated heights is shown in Fig. 5(b). We correctly estimated the axial movement of the tool with a mean absolute error of 11.13  μm. When coupled with our system’s 11  μm half-pitch lateral spatial resolution, this indicates a microscope with nearly isotropic 3D measurements at resolutions requisite within microsurgical scenarios.

Our approach can additionally be used to monitor deformation in tissue. In Fig. 5(c), we show select frames from a video of human skin prodded with a wooden tool to cause stretching and bunching, acquired at 100 FPS with 4×4  pixel downsampling. We were able to visualize fine features in the skin and skin-tool interactions and detect small height changes between frames. The 3D video for this experiment is shown in Video 3.

5.

Discussion and Conclusion

In this work, we introduce the FiLM-Scope, a novel system for acquiring 48× 12.8 MP multiperspective images of a surgical surface in a single snapshot with a single optical system. The FiLM-Scope captures high-resolution images (down to 22  μm) over a large FOV (28.2×37.1  mm), allowing for digital zooming and panning to eventually eliminate the need for surgeons to manually adjust the placement of their surgical scopes. We also propose a self-supervised approach for generating dense height maps from these multiperspective images and show the utility of our approach by monitoring the movement of surgical tools over an ex vivo rat skull and visualizing 3D deformation of human skin.

Many potential samples of interest do not require all 48 images for successful 3D reconstruction. The number of cameras needed is both sample dependent (complex or dense objects may require more cameras) and task dependent (some tasks require finer resolutions or higher frame rates). Thus, in future versions of the FiLM-Scope, a smaller number of cameras could be used to decrease the system size and weight. Alternatively, sets of cameras can be used for different functions. We could focus subsets of cameras to distinct focal planes to improve the overall DOF of the system or use multimodal filters over some cameras for simultaneous, multimodal 3D image acquisition. Initial tests with varied numbers of cameras used in 3D reconstruction are shown in S9 in the Supplementary Material. Based on these results, we expect that a future FiLM-Scope system using only 4×4 cameras can provide a compact, light-weight system that still delivers high-quality depth information.

In addition to the size and weight, a current limitation of the FiLM-Scope is the time required for 3D reconstruction. Although images can be acquired at up to 120 FPS, reconstruction can take several minutes. This is, of course, not compatible with providing live visualization to surgeons. Fortunately, significant work is being performed to increase the speed and efficiency of MVS algorithms,5255 and these changes can be incorporated into our algorithm in the future. In addition, with the collection of a larger dataset, we can improve training and inference to eventually enable live reconstruction.

In conclusion, the FiLM-Scope demonstrates how increasing the number of cameras used in stereoscopes can yield detailed and complete 3D results, with great potential for improving surgical setups, among other applications. We expect future iterations of this system to improve 3D visualizations in a wide range of settings requiring high-resolution 3D imaging over large fields of view.

6.

Appendix A: Optical Design

As described above, the FiLM-Scope has two main optical components: the primary lens and a multicamera array in a 4-f configuration. As detailed in Galdón et al.,48 the key parameters in determining theoretical performance are the focal lengths of the primary and array lenses, the aperture size of the array lenses, and the full diameter of the lens array. To maximize axial resolution, the aperture of the primary lens must be at least as large as the diameter of the lens array, but increasing its aperture size beyond that will not impact performance (Table 2).

Our system uses the MCAM, which was originally designed to capture high-resolution 2D images and video over a large FOV44,45 and has since been used in several architectures to achieve 3D visualization.46,47 The MCAM hardware has been highly optimized to ensure the microcameras are compact, synchronized (S7 in the Supplementary Material), and can stream at high frame rates. As such, this is an ideal choice for the FiLM-Scope setup.

For the primary lens, we used the Zeiss R-Biotar 100 mm f/0.73 lens, which was originally designed to image an X-ray system’s scintillator response onto a large format film. This lens has a large aperture size, 137 mm, which can fully cover the MCAM (max diameter 90 mm), along with a small f-number to improve magnification, resolution, and angular disparity of the acquired images (see Table 2 and S1 in the Supplementary Material). In general, Fourier lightfield imaging requires a primary lens, which is infinity-corrected, and our chosen lens was designed for finite conjugate imaging. However, because it was designed to image a fluorescent response from a distance significantly larger than its focal length onto film immediately adjacent to its back glass surface, we can use the lens backward (i.e., placing our sample where the film previously sat) and approximate the lens as infinity corrected.

7.

Appendix B: System Calibration

The calibration of a 3D imaging system is critical in determining both the accuracy and precision of results, and the complexity or simplicity of the calibration scheme will impact reconstruction time for many algorithms. Thus, we need a model that captures the details of our system while still allowing for rapid forward and backward projection during reconstruction. Because our system includes a large, black-box primary lens that can introduce significant distortions, we chose to use an experimental, ray-based model that does not assume spatial invariance within each microcamera.

As detailed in Sec. 3.2, each of our 48 cameras is described by two pairs of second-order, 2D polynomial equations (giving a total of 48×2×2 equations for the full system). The first pair of equations, Si,x(p) and Si,y(p), describes the “shift ratio” for camera i (the ratio between the shift in pixels experienced by the projection of an object in camera i, to the axial shift in millimeters experienced by the object, or Δp/Δz) in the x and y dimensions. In an ideal system, the shift ratio for camera i would be given by

Eq. (7)

Si=M·tanθi·1ρ=fFdiF·1ρ=f·diF2·1ρ,

where f is the focal length of the array lenses, F is the focal length of the primary lens, θi is the angle of the central ray entering array lens i relative to the optical axis [Fig. 2(a)], and di is the distance from the optical axis to array lens i. To account for nontelecentricity, distortions, and misalignment which may cause both M and θi to vary within an elemental image, we calibrate for Si without explicitly finding M or θi and define an expression to give its value at each pixel location within an elemental image.

The second pair of equations, Oi,x(p) and Oi,y(p), gives the “inter-camera shifts” (the shift in pixels between the pixel location p where a point on the object plane appears in camera i, to where it appears in the reference camera) in the x and y dimensions. In an ideal, paraxial system, the inter-camera shifts would equal 0 for all cameras; however, we calibrate for this value to allow for nonidealities or small misalignments in the system.

The end goal of our calibration is to find nine coefficients for each polynomial equation. Specifically, we want to find all a and b values for the following four equations for each camera i:

Eq. (8)

Si,x(p)=ai,x,0+ai,x,1x+ai,x,1y+…+ai,x,9x2y2,

Eq. (9)

Si,y(p)=ai,y,0+ai,y,1x+ai,y,1y+…+ai,y,9x2y2,

Eq. (10)

Oi,x(p)=bi,x,0+bi,x,1x+bi,x,1y+…+bi,x,9x2y2,

Eq. (11)

Oi,y(p)=bi,y,0+bi,y,1x+bi,y,1y+…+bi,y,9x2y2.

To perform the calibration, we capture a set of images of a graph target at five or more axial planes, separated by 1 mm. We then extract the locations of all the graph vertices in each of the 48 images, at each plane. For a single camera, we can match the pixel locations of each vertex between all five images, and fit a line to the locations (with a slope in pixels/mm). Then, by taking these slopes across the full image, we can use a least squares regression to find coefficients for the polynomial equation, which will take a pixel location p=(px,py) as input, and output the shift ratio Si(p) at that point. This is repeated for all 48 cameras.

Next, we select the object plane and a reference camera. For each nonreference camera, we match all the vertex locations from the target image captured at the object plane, to the corresponding vertex locations in the reference camera’s image. We can then fit polynomial equations Oi,x(p) and Oi,y(p) for each camera i to give the shift in pixels between the target image and reference image at a given pixel location. Taken together, these sets of polynomial equations provide a complete mapping from object to image space, and vice versa. A flowchart for the full calibration procedure is given in S2 in the Supplementary Material.

8.

Appendix C: Height Calculation and Projection

The axial location of a point in space can be calculated from this calibration model using its pixel location in any two cameras. If a point appears at pixel location pk=(pk,x,pk,y) in camera k and pn=(pn,x,pn,y) in camera n, then the height z can be given by either its x or y pixel locations

Eq. (12)

z=(pk,x−pn,x)+[Ok,x(pk)−On,x(pn)]Sk,x(pk)−Sn,x(pn),

or

Eq. (13)

z=(pk,y−pn,y)+[Ok,y(pk)−On,y(pn)]Sk,y(pk)−Sn,y(pn).

From there, if the magnification in camera k is Mk and the pixel pitch is ρ, we can calculate the lateral location of the point with

Eq. (14)

x=[pk,x−Sk,x(pk)·z+Ok,x(pk)]·ρMk,

Eq. (15)

y=[pk,y−Sk,y(pk)·z+Ok,y(pk)]·ρMk,

or equivalently using camera n in place of k.

We will now give a brief explanation of how projection can be performed using this calibration model, which is essential for our 3D reconstruction algorithm. Once a height is assigned to each pixel value, backward projection can be performed by shifting all pixels in each image to the object plane, according to the calculated shift ratio Si and the assigned height. Then, the pixels can be shifted again to the reference frame. Given a pixel location pi in image i with assigned height hp:

Eq. (16)

pi′=pi+Si(pi)·hp,

Eq. (17)

pref′=pi′+Oi(pi′),

Eq. (18)

pref=pref′−Sref(pref′)·hp.

In other words, if a point that projects to location pi in image i were translated axially to the object plane, it would project to pixel location pi′. That same point on the object plane would project to location pref′ in the reference image, and the point at its actual height hp would project to pref in the reference image. Backward projection can then be performed by moving the value at every pixel location pi to its corresponding location in the reference coordinate frame, pref.

We perform this operation using Pytorch’s torch.nn.functional.grid_sample function to efficiently warp the images in a pixel-wise fashion. This function takes two inputs: (1) the image Ii to be warped and (2) a H×W×2 volume, where H and W are the height and width of image Ii in pixels, respectively. Each index in this volume must contain the normalized (px,py) pixel coordinates of the value in Ii, which should be warped to that index. Thus, to perform forward projection, we must start with a pixel location pref in the reference frame and specify the location pi in image Ii that warps to that a pixel location, rather than starting with pi in image Ii and using Eqs. (16)–(18) to find pref. This requires an equation of the form pi=pref+D(hp,pref), where D is some function that can be efficiently computed without prior knowledge of pi. To accomplish this, prior to the start of reconstruction, we generate a dense map S^i, where S^i(pref′)=Si(pi′), and O^i, where O^i(pref′)=Oi(pi′) (more details are given in S2 in the Supplementary Material). With the assumption that O^i(pref′)≈O^i(pref) and S^i(pref′)≈S^i(pref), our full backward projection for a given pixel can then be given as

Eq. (19)

pi=pref+Sref(pref)·href(pref)−O^i(pref)−S^i(pref·href(pref)),

where href is a height map in the reference point of view. To backproject image Ii, the value at location pi=(pi,x,pi,y) is moved to location pref=(pref,x,pref,y), for every x in {1:H} and y in {1:W}. We will notate this as

where I^i is image Ii backprojected to the reference frame using a height map href.

Although forward projection is not needed for the 3D reconstruction algorithm used in this work, it could be done by performing the back projection step in reverse. Starting with an image and height map in the reference frame, we can generate 48 images by shifting the pixels according to the calculated inter-camera pixel shifts, then shifting the pixels in the 48 images again according to their height and the shift ratio at that point. Concretely,

Eq. (21)

pref′=pref+Sref(pref)·hp,

Eq. (22)

pi′=pref′−O^i(pref′),

or, using S′ and O′:

Eq. (24)

pi=pref+hp[Sref(pref)−Si′(pref)]−Oi′(pref).

9.

Appendix D: 3D Reconstruction

Although our calibration model can be applied to a wide range of 3D reconstruction algorithms, for this paper, we used a self-supervised approach adapted from MVS-Net, a previously published network for reconstructing height from MVS images.3335,51 Although occurring on a different spatial scale (the MVS images are macro objects, such as buildings or furniture), the FiLM-Scope images share many similarities with the MVS datasets. In both cases, an image set consists of dozens of images of a mostly opaque object, captured by calibrated cameras viewing from different angular perspectives within a limited solid cone. Thus, self-supervised networks designed for MVS images adapted well to our use case.

MVS-Net begins using a single, shared convolutional neural network (CNN) to create multiple feature maps for each image, then back-projecting these maps through 3D space to form multiple feature volumes. These volumes are fed together into a 3D U-Net, which outputs a probability volume from which a height map can be extracted. The weights on both the shared CNN and the 3D U-Net are updated during training. The loss used varies between implementations, but it generally consists of the difference between an acquired image from a chosen perspective and another image warped to that perspective using the height map. These algorithms rely on the system calibration both to form the back-projected volume and to warp images between perspectives when computing the loss. We can use our custom calibration model to perform these steps.

In addition to our distinct calibration approach, our algorithm contains several key modifications to better suit our application space. First, we remove the feature extraction step with the CNN. This both speeds up reconstruction considerably, as a single volume is computed once at the beginning of reconstruction and repeatedly fed through the U-Net, and reduces the size of the 3D U-Net as we only need to feed in one volume instead of multiple feature volumes. We also found this approach to provide more accurate results for our samples likely because feature extraction removed rich photometric information from the images, which can assist in the 3D reconstruction of surfaces with few sharp features. Second, at this stage, we are not using MVS-Net in the traditional sense to perform training and inference; instead, we use the network as a deep image prior56 and start from a randomly initialized U-Net for each new sample. In this way, we can compute height maps for novel samples, without the need for a prior training dataset. In the future, as more data are generated with the FiLM-Scope, we can adapt our approach to allow for faster 3D reconstruction on a pre-trained network. Finally, the architecture of the 3D U-Net was modified to better match our needed axial resolution (see S3.1 in the Supplementary Material).

A diagram for our algorithm is shown in Fig. 3. First, we generate a digital z-stack Z of size H×W×N by performing back projection for all C cameras (typically 48) to N distinct planes over a defined spatial range, and summing the results for each camera, i.e.,

Eq. (25)

Zn=∑i=0C−1B(Ii,hn),

where Zn is a single plane at index n within Z, and hn is a flat height map for the height at plane n. The range over which Z is computed is chosen manually, according to the height of the object (S3.1 in the Supplementary Material). If the height is not previously known, it can easily be determined by creating a preliminary virtual z-stack over a large range and identifying the planes where the top and bottom of the object are in focus. Ideally, the spacing between planes in the volume should roughly correspond to the axial resolution of the system. However, the algorithm can still produce fairly precise results if larger plane spacing is needed to conserve GPU space.

We also compute a squared volume Z2 with planes

Eq. (26)

Zn,2=∑i=0C−1B(Ii2,hn).

Then, a variance volume is formed:

Eq. (27)

Zvar=Z2/C−(Z/C)2,

where Zvar is fed into a randomly initialized 3D U-Net, which outputs a probability volume P of size H×W×N. A voxel at location (px,py,n) in the volume contains the likelihood of the object that projects to point (px,py) in the reference image existing at the height of plane n, such that

Eq. (28)

∑n=0NP(px,py,n)=1.

From this volume, we can output a height map href by multiplying each plane in the volume by its corresponding height hn, and summing all the planes. If Pn is a plane in volume P

Eq. (29)

href=∑n=0N−1Pn·hn.

This height map is used to back-project all C−1 nonreference images to the reference view, creating a set of “rectified” images, I^i=B(Ii,href).

A structural similarity loss between the rectified images and the reference image, along with an image-aware depth smoothing loss (Lsm), is used to update the weights on the U-Net. Both of these losses are commonly used in 3D height estimation algorithms and are detailed in Ref. 57:

Eq. (30)

LSSIM=∑i=047mean[1−SSIM(Iref,I^i)],

Eq. (31)

LD=Lsm(href,Iref),

Eq. (32)

Ltotal=λ1LSSIM+λ2LD.

Using this simple loss function allowed us to perform height reconstruction with minimal hyper-parameter tuning. λ2 and λ1 were held constant for samples shown in this work, and we expect this generalizability to hold over a large number of opaque sample types. In addition, when reconstructing height for videos with multiple frames, the network only needs to be trained from scratch on the first frame, and subsequent frames can use the previous frame’s network as a starting point, drastically speeding up reconstruction times.

For this paper, the algorithm was run on Nvidia GPUs with up to 48 GB of RAM. This memory was adequate to perform reconstructions over the full FOV in a single run with 4×4 downsampled images, but full-resolution images currently need to be reconstructed in patches. The number of patches needed is sample dependent as samples with larger height variations require a larger number of planes N in the variance volume Zvar, which quickly consumes RAM.

To improve the patching process, we can begin by computing a low-resolution height map href,low and then up-sample it to match the size of the full-resolution map. We then use this upsampled map, which we will notate as hguide, to guide the high-resolution patching process. With this approach, rather than creating Z and Z2 using flat height maps, each plane is a back projection to a curved surface offset from hguide. For instance, to create a volume with N planes spaced d mm apart, the height map to warp to plane n would be given by

Eq. (33)

hn=hguide+d·(n−N/2).

Taking this approach allows us to greatly reduce the number of patches needed for samples with large height variation, as we only need to create a small volume directly around the low-resolution height map, rather than searching over a large height range. More information is given in S3.3 in the Supplementary Material.

As an additional benefit of our reconstruction algorithm, we can reduce noise in the reference view by summing all rectified images I^i, which are already generated during the course of reconstruction. Examples of these summed images compared with the raw images from a single view are shown in S3.4 in the Supplementary Material.

Disclosures

R.H. and M.H. are cofounders of Ramona Optics, Inc., which is commercializing multicamera array microscopes. M.H., P.R., G.H., and K.C.Z. are or were employed by Ramona Optics, Inc. during the course of this research. K.C.Z. was a consultant for Ramona Optics, Inc. The remaining authors declare no competing interests.

Acknowledgments

This work was supported by the National Cancer Institute (NCI) of the National Institutes of Health (Grant No. R44CA250877); the Office of Research Infrastructure Programs (ORIP), Office of the Director, National Institutes of Health, and the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (Grant No. R44OD024879); the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (Grant No. R43EB030979); the National Science Foundation (Grant Nos. 2036439 and 2238845); the Duke Coulter Translational Partnership Award; and the Fitzpatrick Institute at the Duke University.

References

11. 

L. Richevaux et al., “In vivo intracerebral stereotaxic injections for optogenetic stimulation of long-range inputs in mouse brain slices,” JoVE, (151), e59534 https://doi.org/10.3791/59534 (2019). Google Scholar

13. 

B. M. Geiger, M. Irene and E. N. Pothos, Stereotaxic Surgery in Rodents for Stimulation of the Brain Reward SystemReward systems, 21 –50 Springer US, New York (2021). Google Scholar

20. 

R. M. Trout et al., “Methods for real-time feature-guided image fusion of intrasurgical volumetric optical coherence tomography with digital microscopy,” Biomed. Opt. Express, 14 (7), 3308 –3326 https://doi.org/10.1364/BOE.488975 BOEICL 2156-7085 (2023). Google Scholar

28. 

Y. Wu et al., “A novel method for preoperative 6D pose estimation of rat skull based on 3D vision techniques,” in 27th Int. Conf. Mechatron. and Mach. Vision in Pract. (M2VIP), 223 –228 (2021). Google Scholar

33. 

Y. Yao et al., “MVSNet: depth inference for unstructured multi-view stereo,” in Proc. Eur. Conf. Comput. Vision (ECCV), 767 –783 (2018). Google Scholar

34. 

Y. Dai et al., “MVS2: deep unsupervised multi-view stereo with multi-view symmetry,” in Int. Conf. 3D Vision (3DV), 1 –8 (2019). Google Scholar

42. 

J. Zhai, R. Shi and L. Kong, “Improving signal-to-background ratio by orders of magnitude in high-speed volumetric imaging in vivo by robust Fourier light field microscopy,” Photonics Res., 10 (5), 1255 –1263 https://doi.org/10.1364/PRJ.451895 (2022). Google Scholar

49. 

M. Martínez-Corral and B. Javidi, “Fundamentals of 3D imaging and displays: a tutorial on integral imaging, light-field, and plenoptic systems,” Adv. Opt. Photonics, 10 (3), 512 –566 https://doi.org/10.1364/AOP.10.000512 AOPAC7 1943-8206 (2018). Google Scholar

58. 

neptune.ai, “neptune.ai: experiment tracker,” (2024). Google Scholar

59. 

T. Mertens, J. Kautz and F. Van Reeth, “Exposure fusion,” in 15th Pac. Conf. Comput. Graphics and Appl. (PG’07), 382 –390 (2007). Google Scholar

Biography

Clare B. Cook is a PhD candidate at the Duke University, in the Department of Biomedical Engineering. She is currently studying in the Computational Optics Lab, under Dr. Roarke Horstmeyer. She received her BSE degree in electrical engineering from Princeton University in 2020. Her current research focuses on high-speed 3D imaging systems for the mesoscale regime.

Kevin C. Zhou is an assistant professor of biomedical engineering at the University of Michigan. He received his BS degree from Yale University, MS degree and PhD from Duke University, and postdoctoral training from UC Berkeley. His lab’s research focuses on developing high-throughput 4D computational optical imaging systems. He is a recipient of the Barry Goldwater Scholarship, NSF Graduate Research Fellowship, and Schmidt Science Fellowship.

Martin Bohlen’s research focuses on understanding the structure–function relationship within the central nervous system. He investigates the anatomical organization of the brain in mice, rats, cats, and macaques. He has been developing efficient viral techniques for macaque and rat brains to elucidate visuomotor transformations structurally and functionally. His work involves stereotaxic surgeries on rodents and primates using surgical microscopes. By innovating surgical scopes—a tool largely unchanged in 50 years—his work advances both surgery and pathology.

Mark Harfouche, Chief Technology Officer at Ramona Optics, holds a PhD in electrical engineering from Caltech and a bachelor’s degree in engineering science from the University of Toronto. He brings deep expertise across electrical and mechanical engineering, programming, applied physics, device physics, and quantum mechanics. He leads innovation in advanced microscopy systems, integrating cutting-edge hardware and software to create optical imaging platforms that translate fundamental research into tools for biological discovery.

Kanghyun Kim received his BS degree in statistics with a minor in Computer Science from Chung-Ang University. He completed an MS degree in electrical and computer engineering and a PhD in biomedical engineering at Duke University. His research focuses on designing microscopic imaging methods based on densely packed arrays of micro-cameras to jointly image large areas at high resolutions, along with the development of automated analysis algorithms using machine learning.

Julia S. Foust is currently a fourth-year PhD candidate in biomedical engineering at Duke University, where she was previously advised by Joseph Izatt, PhD, and is currently co-advised by Roarke Horstmeyer, PhD, and Cynthia Toth, MD. Her research spans theoretical low-NA Gaussian beam focusing and the development of noninvasive, high-throughout optical imaging techniques for clinical applications. Her work focuses on advancing ophthalmic microsurgery and pediatric retinal imaging through next-generation microscopy and optical coherence tomography.

Ramana Balla holds an MS degree in biomedical engineering from Duke University with a certificate in biomedical data science, and a BTech in chemical engineering from the Indian Institute of Technology Hyderabad. His work at Duke focused on computational biology with applications in optics, protein engineering, and pharmacology, through specialized coursework and research projects.

Amey Chaware is a biomedical engineering PhD student at Duke University. He is a research assistant at Duke Computational Optics Lab and he specializes in optical systems and machine learning for digital pathology. His work focuses on building camera array imaging systems and subsequent analysis pipelines.

Roarke Horstmeyer is an assistant professor of biomedical engineering and electrical and computer engineering at Duke University. He is also the scientific director at Ramona Optics. He leads the Computational Optics Lab at Duke, which develops microscopes, cameras, and computer algorithms for a wide range of biomedical imaging and sensing applications. He earned a PhD from Caltech’s Electrical Engineering Department (2016), an MS degree from the MIT Media Lab (2011), and bachelor’s degrees in physics and Japanese from Duke in 2006.

Biographies of the other authors are not available.

Read Entire Article