1.
Introduction
Stereo vision has been used in surgical settings for over 100 years, at least since 1922, when Gunnar Holmgren first used a binocular operative microscope for an ear procedure.1 Since then, stereoscopes and stereo vision have been incorporated into both the practice and teaching workflows of microsurgeons in wide-ranging fields, including ophthalmology,2 neurosurgery,3 and otolaryngology,4 as well as in research settings, such as stereotaxic rodent surgery.5 Recent decades have seen the rise of digital stereoscopes alongside their analog counterparts. Digital stereoscopes, as well as related stereo-exoscopes, simultaneously capture two digital images of a surgical area from unique perspectives for a separate display to each eye of the operator.2 Heads-up stereoscopic displays provide ergonomic and teaching benefits and have enabled 3D vision in fields where analog stereoscopes cannot be easily deployed, such as minimally invasive surgery.6
Beyond the qualitative depth perception that analog and digital stereoscopes provide to human operators, quantitative 3D scene information is becoming increasingly critical in surgical settings. Automated and robotic surgical systems may rely on 3D cues for tool positioning5,7 and monitoring tissue surface deformation.8–10 Quantitative 3D feedback can also assist in the placement of handheld surgical tools. For instance, researchers performing stereotaxic rodent surgery frequently need to perform injections or place implants within specific regions of in vivo rodent brains, which are challenging and error-prone procedures.11,12 Precise and quantitative 3D feedback on the tool placement within the surgical area could improve the success rate of such procedures and increase their accessibility to labs without highly trained surgeons. However, accurately computing quantitative depth at the microscale is challenging with current digital stereoscopes, which provide just two unique views.
Numerous methods beyond stereoscopes have been developed in an attempt to provide depth information during microsurgery. First, in many procedures, sample alignment is performed prior to the start of surgery, and 3D position is tracked throughout the procedure based on the known anatomy or structure of the sample. For instance, in rodent stereotaxic surgery, the skull is mechanically aligned with a stereotaxic apparatus,13 and regions of interest can then be localized using a known atlas of the rodent brain. However, this process is time-consuming and requires a high level of expertise, leading to high failure rates.5,12 To improve streotaxic procedures, haptic feedback from robotic systems can assist surgeons in navigating 3D space.14–16 Although these systems are precise, they can only be employed in specific confined settings and must follow pre-surgical alignment. To measure 3D tissue deformations in real time, optical coherence tomography (OCT) is utilized in microsurgery.4,17–19 OCT, either as a standalone modality or to supplement or guide digital stereoscopes, can operate with high precision and accuracy to distinguish between thin layers of tissue and visualize small vascular structures. Unfortunately, OCT typically has a limited field-of-view (FOV), forcing surgeons to frequently re-position their imaging system during surgery. The provided images may also be less intuitive for surgeons compared to conventional RGB images.20
Recently, novel computational microscopes have been developed to capture video from >2 unique perspectives to increase the accuracy of high-resolution quantitative 3D measurements. Lightfield21 and Fourier lightfield microscopes (FLFMs),22 for example, capture up to dozens of unique perspectives to robustly solve for specimen depth in the presence of various lighting conditions at each pixel. Unfortunately, such methods currently face an inherent trade-off between spatial and angular resolution, which generally results in 3D images with low pixel counts—far fewer than what is desired in microsurgical settings, where high resolution within a large FOV is the target criteria (e.g., for display on current 4K or upcoming 8K monitors).
In this work, we address these challenges with a novel camera array-based Fourier lightfield multiview stereo scope (FiLM-Scope), which uses the principles behind FLFM alongside a high-density array of compact cameras to simultaneously stream 48 high-resolution, large pixel-count viewpoints over an angular range of 21 deg×29 deg. A unique post-processing pipeline then combines these views into a 12.8 megapixel (28×37 mm FOV, 22 μm lateral resolution) 3D video, which allows for digital zooming and panning without the need for manually moving or adjusting the scope during surgery. We demonstrate the utility of the system for microsurgical settings by tracking the 3D movement of surgical tools over the surface of an ex vivo rat skull and showing video-rate deformation in stretching human skin.
2.
Background
Before describing the details of our new approach, we first provide additional background on related existing technology.
2.1.
Depth-From-Stereo
Depth-from-stereo extracts quantitative height information from two or more images of an object. This large body of work includes both conventional (typically feature extraction and matching) and deep learning solutions.23 There has been extensive research into optimal methods for performing depth-from-stereo in surgical settings, which has been successfully demonstrated in image-guided surgery.6,24,25 Regardless, surgical stereo vision has major drawbacks. Optical features are difficult to identify and match in sparse, low-contrast regions, which is a major challenge when imaging soft tissue. Occlusions, speckles, specular reflections, and other artifacts also drastically reduce the effectiveness of feature matching between stereo image pairs.9,24
As a result, although analog and digital stereoscopes are featured in many microsurgery workflows to provide qualitative depth perception to human operators,26,27 they are rarely used to provide quantitative 3D information, as they do not typically have the precision to resolve micron-scale surgical targets, including retinal membranes, fragile brain tissue, and small blood vessels.19 Depth-from-stereo has been employed in microsurgery to estimate the 3D pose of macroscopic rigid objects—such as digital alignment of rodent skulls prior to or during surgery5,28 and tool tracking during surgery.7,29,30 However, estimating the 3D structure of deformable tissue surfaces is considerably more challenging. As an example, several prior works have attempted 3D reconstruction of retinal surfaces from stereo images, which required prior assumptions of the shape and curvature of the retina that limit generalizability and prevent tracking of tissue deformations during surgery.29,30 In this work, we demonstrate how the FiLM-Scope can accurately measure the 3D height of complex, deformable tissue surfaces while jointly tracking surgical tools with down to 11 μm accuracy, thus offering a suitable measurement system for real-time 3D guidance within microsurgical settings.
2.2.
Multiview Stereo
In macroscale 3D imaging, many of the limitations of traditional binocular stereo imaging have been overcome through the use of many cameras, otherwise known as multiview stereo (MVS).31,32 Compared with binocular stereo or depth-from-stereo using a small number of cameras, MVS can provide depth information with greater accuracy and precision. The use of many cameras allows for greater angular coverage, and the redundancy of the images can overcome challenges with noise, reflection, occlusions, and texture-sparse regions.31 With the rise of self-supervised machine learning approaches, 3D reconstruction from MVS images can be performed without extensive training data or hand-crafted features.33–35 As microscopic imaging with multiple cameras in surgical settings is practically challenging, limited prior work has applied MVS within microsurgery.
2.3.
Lightfield Microscopy
As noted above, FLFM can form multiperspective images from a single image sensor. FLFM, as well as similar systems published under the names extended-LFM (XLFM)36 and Fourier integral microscopy (FI-MIC),37,38 uses a cascade of a microscope objective lens, relay lenses, and a microlens array to form multiple images of a sample from different perspectives onto one digital image sensor. FLFM has been used to perform high-speed 3D imaging of microscopic samples, including human organoid dynamics,39 live cell processes,40,41 and neural activity in small organisms.36,42 However, scaling FLFM for surgical imaging over larger FOVs presents several key challenges. Most FLFM systems form many images of an object on a single sensor, so the number of available pixels must be divided between the number of images formed. This leads to unacceptably low resolution when acquiring a large number of angular perspectives on conventional sensors, while adopting new sensors with several hundred or more megapixels (MP) is not an option when video-rate imaging is required. The camera array-based light-field microscopy (CALM) system successfully demonstrated an FLFM system using multiple imaging sensors.43 However, the CALM system was not designed for quantitative video-rate 3D profiling of dynamic surfaces as is demonstrated here.
3.
FiLM-Scope Hardware and Algorithm Design
3.1.
Optical Design
The FiLM-Scope optical design consists of two main components: the “primary lens” (analogous to the “Fourier” lens of conventional FLFM setups22) and a compact multicamera array microscope [MCAM, Fig. 1(a)], placed together in a 4f configuration [Fig. 2(a)]. Our utilized MCAM44–47 consists of 48 microcameras, tightly packed into a 6×8 grid with 9 mm inter-camera spacing. Each microcamera includes its own custom multi-element lens and 12.8 MP CMOS sensor, all of which are triggered synchronously from a shared FPGA (S7 in the Supplementary Material). In the FiLM-Scope configuration, the 48 cameras concurrently acquire high-resolution images of our sample from 48 unique perspectives, giving a total of 600 MP per snapshot [Fig. 1(b)]. The MCAM can stream at 5 gigabytes (GB) per second, enabling video at ∼8 frames-per-second (FPS) full resolution, or up to 120 FPS with 4×4 pixel binning.
Fig. 1
FiLM-Scope configuration and example images. (a) Top: picture of multicamera array microscope (MCAM) lenses. 48× lenses are held in a single block, with 9 mm pitch. Bottom: 3D model of FiLM-Scope. (b) Snapshots from the FiLM-Scope of an ex vivo rat skull. 48 images are acquired in a single frame, allowing for video acquisition at up to 120 frames per second. The green and purple insets highlight the angular range of the system, whereas the red inset highlights the high resolution. (c) 3D height map from the FiLM-Scope. Top left: single image of the rat skull from the FiLM-Scope. Top-right: height map for the rat skull. Bottom: skull displayed as a 3D point cloud, generated using the height map. (d) Four frames from a video of a hand-held tool moving over the surface of a rat skull, acquired at 100 frames per second. 3D reconstruction was performed on all four frames, and the resulting point clouds are overlayed in the bottom frame, highlighting the motion of the tool in 3D space.
Fig. 2
FiLM-Scope calibration results. (a) Diagram of ideal parametric FiLM-Scope system, demonstrating how 3D information is encoded in the images. Red points lie on object plane and appear at the same pixel location in both highlighted elemental images. Blue points are axially translated off the object plane. On the center elemental image, the blue points project to the same image location as the red points (forming purple points). In the upper elemental image, the projections of the blue points are laterally translated relative to the projections of the red points. (b) Example FiLM-Scope calibration results, showing shift slopes (Si) for all 48 cameras (all units: pixels/mm). Top: average shift slope value for each camera. Bottom (red and blue insets): Varying Si(p) values for two cameras across the image fields-of-view. (c) Top: on-axis, diffraction-limited axial resolution for each camera with a logarithmic scale bar. Bottom: on-axis, diffraction limited lateral resolution for each camera. (d) Digital refocusing results with raw reference camera image (top) and digital refocus back-projections of all 48 images to the respective axial planes (0, 6.4, 8.0, and 10.7 mm) using data from a snapshot in Fig. 1(b). A full digitally refocused volume is shown in Video 1 (Video 1, Mp4, 19.4 MB [URL: https://doi.org/10.1117/1.APN.4.4.046008.s1]).
To utilize all 48 cameras, the primary lens must have an aperture larger than the diameter of the MCAM (90 mm). In addition, the f-number of the lens (the ratio between its focal length and diameter) should be as small as possible to improve lateral and axial resolution. When designing the FiLM-Scope, we used the definition for axial resolution detailed by Guo et al.22 and Galdón et al.,48 which provides a resolution metric for the optical system, independent of the reconstruction algorithm. Depending on the sample type and reconstruction algorithm, the true axial precision can fall above or below this value. For a given microcamera i, the axial resolution (ϕax,i) is the smallest axial separation at which two laterally aligned points are separately resolved in camera i. In our setup, this is given as
where ϕlat,i is the lateral resolution in camera i and θi is the angle between the optical axis and the chief ray for camera i. The axial resolution of the system is then reported as the finest resolution among all the cameras [Fig. 2(c)]:
where δmax is the distance between the optical axis and the array lens furthest from the axis and F is the focal length of the primary lens (see Table 2, and S1 in the Supplementary Material). In reality, the FiLM-Scope can provide far more precise 3D results than this value would suggest, depending on the sample and the algorithm used in 3D reconstruction. Regardless, this provides a useful benchmark for assessing the system’s optical design.
To ensure our system has an adequate axial resolution, we chose the Zeiss R-Biotar 100 mm f/0.73 lens as our primary lens, which was originally designed to image an X-ray system’s scintillator response onto a large format film. Because this lens is not fully infinity corrected, the FiLM-Scope suffers from spherical and coma aberrations, resulting in a significant resolution fall-off toward the edges of all the images, as well as throughout the FOV for cameras on the edge of the array [Fig. 2(c), and S1 in the Supplementary Material]. In addition, each of the 48 images has a unique distortion pattern, which is compensated for in our calibration procedure. In future iterations of this system, a custom-designed primary lens can further improve overall system performance. The specifications for all lenses are given in Table 1.
Because 3D information encoding in FLFM relies on the lens configuration rather than illumination patterns, the FiLM-Scope can be employed with a range of illumination setups. Images shown in this work were acquired with one of two illumination setups: exterior epi-illumination with a flexible LED strip affixed to the front of the lens, or internal epi-illumination from directly behind the primary lens via a large ring illuminator, both of which are detailed in S8 in the Supplementary Material.
Table 1
System specifications. The array lenses are custom-designed lenses from Edmund Optics, designed to fit in the tight 9 mm pitch of the MCAM sensor board. The lenses and sensors are packed into a 6×8 grid, which determines DMCAM and δmax. The primary lens is the Zeiss R-Biotar f/0.73 lens, originally designed for use in an X-ray system.
Primary lens focal length | F | 100 mm |
Primary lens diameter | D | 100 mm/0.73=137 mm |
Array lens focal length | f | 14.64 mm |
Array lens diameter | d | 5.7 mm |
Pixel pitch | ρ | 1.1 μm |
MCAM lens spacing | p | 9 mm |
Distance from optical axis to furthest lens | δmax | ((6−1)·p)2+((8−1)·p)2/2=38.7 mm |
MCAM diameter | DMCAM | (6·p)2+(8·p)2=90 mm |
3.2.
Calibration
In Fourier lightfield imaging, each array lens receives rays emitted from the object from a unique range of angles. As a result, when an object shifts axially away from the focal plane (Δz), its projection in each image shifts laterally [Fig. 2(a)]. In an ideal, paraxial system, the magnitude of that shift in pixels will be given by
where M is the system magnification, θi is the angle between the chief ray entering lens i and the optical axis, and ρ is the pixel pitch. Because points at the object plane should appear at identical pixel locations in every elemental image (an image formed by a single camera in the array), we can calculate the distance of a point from the object plane by finding the disparity between its projection in any two elemental images. Specifically, if a point appears at pixel pi in image i and pj in image j, we can find its axial location z with the following equation:
This concept forms the basis of projection between image and object space in FLFM, which is utilized in nearly all 3D reconstruction approaches.
Due to the distortions that arise from using a large primary lens and small misalignment errors, our system differs from the ideal system in several key ways. First, both M and θ may vary within a single elemental image. Second, we cannot assume that samples at the object plane will project identically to each elemental image. Thus, we need a calibration model that can capture these deviations. Due to the black box nature of our primary lens, we chose to use an experimental calibration scheme.
In our calibration model, each camera i is described by two pairs of equations. The first pair, Si,x(p) and Si,y(p), describes the ratio between the axial shift of a point in object space (Δz, given in millimeters) and the lateral shift of its projection in image space (Δpi, given in pixels), which we will refer to as the “shift ratio.” The second pair, Oi,x(p) and Oi,y(p), describes the shift in pixels between where a point on the object plane appears in image i and where it appears in a chosen reference perspective, which we will call the “inter-camera shifts.” All four equations are parameterized by the coefficients for a second order, 2D polynomial equation, where the input is a 2D pixel location, p=(px,py). Thus, our full calibration model consists of 48×2×2 polynomial equations, which together can fully describe the forward projection from object space into the image space.
Si,x(p) and Si,y(p) are calibrated separately for shifts in the x and y dimensions, without explicitly estimating M or θ [Fig. 2(b)]. Specifically, for a given pixel location pi in camera i
We experimentally fit the coefficients for Si(p) for each camera by imaging a graph target at five or more distinct axial planes, measuring how far each graph vertex shifts between images, and fitting a line to those shifts. Next, we fit the coefficients of Si(p) to give the slopes of those lines across the image FOV. We similarly fit Oi(p) by imaging the graph target at the object plane, then measuring the shift in pixels between the locations of the graph vertices in image i and the locations of the same vertices in the chosen reference image. More information is given in Appendix B.
After calibration, the parametrized model of our system can be applied to a wide range of 3D reconstruction algorithms. As a first example, we can generate a digital z-stack by back-projecting all of the images onto a chosen z plane, which brings into focus all points lying at the selected plane and blurs all other regions, akin to the “shift-and-sum” or “digital refocusing” operation in traditional lightfield imaging49,50 [Fig. 2(d), Video 1]. This z-stack can be used to qualitatively visualize 3D structures in the imaged object.
3.3.
3D Reconstruction Algorithm
To obtain dense, quantitative height information, we adopted a similar approach as demonstrated in prior algorithms like MVS-Net, which was designed for MVS 3D reconstruction of macroscale objects.33–35,51 Our unsupervised optimization algorithm is outlined in Fig. 3. First, all 48 multiperspective images are back-projected through the expected depth range of the sample to create a digital z-stack. The z-stack is oriented to the perspective of a single reference camera, which we choose to be one of the four center cameras in the array. We will refer to this perspective as the “reference view.” The z-stack is then fed through a 3D U-Net, which outputs a probability volume of the same size, describing the likelihood of the sample’s surface passing through each voxel in the volume. For each x,y location in the volume, the values along the z axis sum to 1, so we can extract a depth map from the volume by multiplying each plane in the volume by its corresponding height, and summing all the planes (see Appendix D for more details).
Fig. 3
Diagram for self-supervised 3D reconstruction algorithm developed for the FiLM-Scope. Images are first back-projected into a 3D volume, using our custom calibration approach. The volume is then fed through a 3D U-Net, outputting a probability volume from which we can extract a height map. All images are rectified to a reference viewpoint using this height map, and the loss between each of the rectified images and the “reference image” (acquired from the reference camera) is used to update weights on the U-Net.
A loss is computed using the depth map to rectify all 48 images to the reference view and subsequently taking the structural similarity between the rectified images and the original image acquired from the reference camera. This loss is used to update weights on the U-Net. The same z-stack is then iteratively fed through the U-Net while updating the weights until the algorithm converges. The final depth map can then be Gaussian or median filtered to further reduce noise.
4.
Results
4.1.
Optical Specifications
The FiLM-Scope achieves a FOV of up to 28.2×37.1 mm, with on-axis lateral resolution and depth-of-field (DOF) for the central camera of 21.6 μm and 3.1 mm respectively, measured by taking the line-spread-function across a USATF resolution target (S1.2 in the Supplementary Material). The system does experience aberrations toward the edges of the FOV, which are described in S1.5 in the Supplementary Material. Using the definition of axial resolution provided in Guo et al.,22 our optical design has an on-axis diffraction-limited axial resolution of 82.1 μm [Fig. 2(c)]. Although Fourier lightfield imaging does provide generally consistent 3D precision within the imaging volume,22 the optics-dependent axial resolution is proportional to the lateral resolution [Eq. (1)], so the axial resolution will vary across the FOV and within the DOF (S6 in the Supplementary Material), with the system aberrations. Because the true system precision is also dependent on the sample type and reconstruction algorithm, we can produce 3D results with greater precision than this theoretical value. Our full reconstruction pipeline has yielded results with down to 11 μm standard error (Figs. 4 and 5). The governing equations and theoretical and measured specifications are summarized in Table 2, and the process for calculating these values is detailed in Appendix A and S1 in the Supplementary Material.
Fig. 4
Select results. (a) Images and height maps for an ex vivo rat skull. For each of the four crops, the left panel shows the grayscale image, and the right panel shows an overlay of the reconstructed height map and the grayscale image. The FiLM-Scope algorithm detects small height changes on the skull surface, down to fractions of a millimeter. (b)–(c) Grayscale images and height maps of human fingers. Insets highlight details of the reconstructed height map. In the bottom inset for each image, the global tilt in the height map is removed to clearly highlight 3D spatial resolution.
Fig. 5
FiLM-Scope video acquisition. (a) 3D tracking of points on a tool, as a human user moves the tool over the surface of an ex vivo rat skull. Top: select frames with marked points. Bottom: graph of estimated heights for each point during 8 s of acquired video, captured at 100 FPS, along with the angle of the tool, measured between two points on its surface. The full video is provided in Video 2 (Video 2, Mp4, 16.4 MB [URL: https://doi.org/10.1117/1.APN.4.4.046008.s2]). (b) Estimated height of tool translated axially on an automated stage away from an ex vivo rat skull. Our algorithm accurately estimated the height of the tool with 11 μm standard error. (c) Select frames from a video and corresponding height maps of human skin. Video was acquired with 4×4 pixel binning at 100 FPS. In the right-most insets, global tilt was removed from the height maps to highlight fine features. The full video is shown in Video 3 (Video 3, Mp4, 38.2 MB [URL: https://doi.org/10.1117/1.APN.4.4.046008.s3]).
Table 2
Major system performance metrics are summarized here. All metrics excepting angular range can be reported separately for each camera. Here, values for magnification, lateral resolution, and depth of field are reported for the center camera, and the mean and standard deviation across all cameras. Axial resolution is reported for the camera with the smallest value as that determines the axial resolution of the system as a whole. A wavelength value of λ=530 μm was used in the relevant calculations. More details are given in S1 in the Supplementary Material.
Magnification | M | f/F | 0.146 | 0.1215 | 0.1212 | 0.0004 |
Numerical aperture | NA | tan−1(d/2F) | 0.028 | — | — | — |
Diff. lateral resolution | ϕlat,diff | λ/(2·NA) | 9.30 μm | 21.63 μm | 23.43 μm | 2.57 μm |
Geo. lateral resolution | ϕlat,geo | 2ρ/M | 15.03 μm | 18.10 μm | 18.15 μm | 0.05 μm |
Diff. axial resolution | ϕax,diff | ϕlat,diff/(δmaxF) | 24.0 μm | 82.1 μm | — | — |
Geo. axial resolution | ϕax,geo | ϕlat,geo/(δmaxF) | 38.8 μm | 53.5 μm | — | — |
Depth of field | DOF | n2−NA2NA2·λ | 652 μm | 3.08 mm | 3.13 mm | 0.05 mm |
Angular range | — | 2·tan−1(δ/F) | (25.4 deg, 34.0 deg) | (21.4 deg, 29.4 deg) | — | — |
4.2.
Height Maps
Applying our 3D reconstruction algorithm (Fig. 3) to snapshots acquired with the FiLM-Scope, we can distinguish the heights of fine features within the DOF of the system while also reconstructing large structures well outside the DOF, up to 1 cm. Example height maps are shown in Fig. 4. Figure 4(a) shows small height changes on the surface of a rat skull. In these maps, global tilt was removed to better show the precision of the results. The height map for a full rat skull is shown in Fig. 1.
Figures 4(b)–4(c) show height maps for human fingers. From these results, we can see both the curvature of the finger over a depth of about 1 cm and the heights of small cracks and height changes on the skin surface. Additional reconstruction results with ground truth samples are included in S6 and S9 in the Supplementary Material, and a comparison against OCT data is shown in S5 in the Supplementary Material.
4.3.
3D Video Acquisition and Reconstruction
The FiLM-Scope acquires all information needed for 3D reconstruction in a single snapshot, so we can acquire 3D video at the full frame rate of the MCAM. Currently, the streaming rate of the MCAM is ∼5 GB/sec, which can be flexibly allocated between the number of cameras used, the streaming pixels per camera, and the system frame rate. For example, we can stream from all 48 cameras at full resolution at ∼8 FPS, or from 48 cameras with 4×4 pixel down-sampling at ∼120 FPS. The frame rate can be further increased by streaming from only a subset of the 48 cameras. After acquisition, we perform 3D reconstruction on the first video frame using the previously described algorithm (Sec. 3.3). We then sequentially reconstruct each subsequent frame by fine-tuning the network from the previous frame. This approach drastically speeds up reconstruction time compared with starting from a randomly initialized U-Net for each frame.
To demonstrate the utility of our approach for 3D visualization of tissue surfaces and tool movement during microsurgery, we used the FiLM-Scope to capture 3D video of tools moving over the surface of an ex vivo rat skull. We first acquired a video at 100 FPS with 4×4 pixel down-sampling while a human user freely moved the tool in 3D space and tapped the surface of the skull. After performing 3D reconstruction, we manually selected five points along the tool in the first frame from the reference camera and used optical flow to track these points throughout the video. We then extracted the heights of these points from the corresponding locations in the height map. Plotting these results [Fig. 5(a)], we can see the variation in the height and angle of the tool as the user lowers the tool to touch the skull. The 3D video from this experiment is provided in Video 2.
Next, we mounted the tool on an electronic stage to move it to known locations along the optical axis of the system. We acquired a full-resolution snapshot at each stage height and performed 3D reconstruction on all frames. We then selected a single point on the tool and extracted the height at that point in all frames, averaging over a 100×100 pixel region (corresponding to a 0.9×0.9 mm region of the tool). The plot of our estimated heights is shown in Fig. 5(b). We correctly estimated the axial movement of the tool with a mean absolute error of 11.13 μm. When coupled with our system’s 11 μm half-pitch lateral spatial resolution, this indicates a microscope with nearly isotropic 3D measurements at resolutions requisite within microsurgical scenarios.
Our approach can additionally be used to monitor deformation in tissue. In Fig. 5(c), we show select frames from a video of human skin prodded with a wooden tool to cause stretching and bunching, acquired at 100 FPS with 4×4 pixel downsampling. We were able to visualize fine features in the skin and skin-tool interactions and detect small height changes between frames. The 3D video for this experiment is shown in Video 3.
5.
Discussion and Conclusion
In this work, we introduce the FiLM-Scope, a novel system for acquiring 48× 12.8 MP multiperspective images of a surgical surface in a single snapshot with a single optical system. The FiLM-Scope captures high-resolution images (down to 22 μm) over a large FOV (28.2×37.1 mm), allowing for digital zooming and panning to eventually eliminate the need for surgeons to manually adjust the placement of their surgical scopes. We also propose a self-supervised approach for generating dense height maps from these multiperspective images and show the utility of our approach by monitoring the movement of surgical tools over an ex vivo rat skull and visualizing 3D deformation of human skin.
Many potential samples of interest do not require all 48 images for successful 3D reconstruction. The number of cameras needed is both sample dependent (complex or dense objects may require more cameras) and task dependent (some tasks require finer resolutions or higher frame rates). Thus, in future versions of the FiLM-Scope, a smaller number of cameras could be used to decrease the system size and weight. Alternatively, sets of cameras can be used for different functions. We could focus subsets of cameras to distinct focal planes to improve the overall DOF of the system or use multimodal filters over some cameras for simultaneous, multimodal 3D image acquisition. Initial tests with varied numbers of cameras used in 3D reconstruction are shown in S9 in the Supplementary Material. Based on these results, we expect that a future FiLM-Scope system using only 4×4 cameras can provide a compact, light-weight system that still delivers high-quality depth information.
In addition to the size and weight, a current limitation of the FiLM-Scope is the time required for 3D reconstruction. Although images can be acquired at up to 120 FPS, reconstruction can take several minutes. This is, of course, not compatible with providing live visualization to surgeons. Fortunately, significant work is being performed to increase the speed and efficiency of MVS algorithms,52–55 and these changes can be incorporated into our algorithm in the future. In addition, with the collection of a larger dataset, we can improve training and inference to eventually enable live reconstruction.
In conclusion, the FiLM-Scope demonstrates how increasing the number of cameras used in stereoscopes can yield detailed and complete 3D results, with great potential for improving surgical setups, among other applications. We expect future iterations of this system to improve 3D visualizations in a wide range of settings requiring high-resolution 3D imaging over large fields of view.
6.
Appendix A: Optical Design
As described above, the FiLM-Scope has two main optical components: the primary lens and a multicamera array in a 4-f configuration. As detailed in Galdón et al.,48 the key parameters in determining theoretical performance are the focal lengths of the primary and array lenses, the aperture size of the array lenses, and the full diameter of the lens array. To maximize axial resolution, the aperture of the primary lens must be at least as large as the diameter of the lens array, but increasing its aperture size beyond that will not impact performance (Table 2).
Our system uses the MCAM, which was originally designed to capture high-resolution 2D images and video over a large FOV44,45 and has since been used in several architectures to achieve 3D visualization.46,47 The MCAM hardware has been highly optimized to ensure the microcameras are compact, synchronized (S7 in the Supplementary Material), and can stream at high frame rates. As such, this is an ideal choice for the FiLM-Scope setup.
For the primary lens, we used the Zeiss R-Biotar 100 mm f/0.73 lens, which was originally designed to image an X-ray system’s scintillator response onto a large format film. This lens has a large aperture size, 137 mm, which can fully cover the MCAM (max diameter 90 mm), along with a small f-number to improve magnification, resolution, and angular disparity of the acquired images (see Table 2 and S1 in the Supplementary Material). In general, Fourier lightfield imaging requires a primary lens, which is infinity-corrected, and our chosen lens was designed for finite conjugate imaging. However, because it was designed to image a fluorescent response from a distance significantly larger than its focal length onto film immediately adjacent to its back glass surface, we can use the lens backward (i.e., placing our sample where the film previously sat) and approximate the lens as infinity corrected.
7.
Appendix B: System Calibration
The calibration of a 3D imaging system is critical in determining both the accuracy and precision of results, and the complexity or simplicity of the calibration scheme will impact reconstruction time for many algorithms. Thus, we need a model that captures the details of our system while still allowing for rapid forward and backward projection during reconstruction. Because our system includes a large, black-box primary lens that can introduce significant distortions, we chose to use an experimental, ray-based model that does not assume spatial invariance within each microcamera.
As detailed in Sec. 3.2, each of our 48 cameras is described by two pairs of second-order, 2D polynomial equations (giving a total of 48×2×2 equations for the full system). The first pair of equations, Si,x(p) and Si,y(p), describes the “shift ratio” for camera i (the ratio between the shift in pixels experienced by the projection of an object in camera i, to the axial shift in millimeters experienced by the object, or Δp/Δz) in the x and y dimensions. In an ideal system, the shift ratio for camera i would be given by
where f is the focal length of the array lenses, F is the focal length of the primary lens, θi is the angle of the central ray entering array lens i relative to the optical axis [Fig. 2(a)], and di is the distance from the optical axis to array lens i. To account for nontelecentricity, distortions, and misalignment which may cause both M and θi to vary within an elemental image, we calibrate for Si without explicitly finding M or θi and define an expression to give its value at each pixel location within an elemental image.
The second pair of equations, Oi,x(p) and Oi,y(p), gives the “inter-camera shifts” (the shift in pixels between the pixel location p where a point on the object plane appears in camera i, to where it appears in the reference camera) in the x and y dimensions. In an ideal, paraxial system, the inter-camera shifts would equal 0 for all cameras; however, we calibrate for this value to allow for nonidealities or small misalignments in the system.
The end goal of our calibration is to find nine coefficients for each polynomial equation. Specifically, we want to find all a and b values for the following four equations for each camera i:
To perform the calibration, we capture a set of images of a graph target at five or more axial planes, separated by 1 mm. We then extract the locations of all the graph vertices in each of the 48 images, at each plane. For a single camera, we can match the pixel locations of each vertex between all five images, and fit a line to the locations (with a slope in pixels/mm). Then, by taking these slopes across the full image, we can use a least squares regression to find coefficients for the polynomial equation, which will take a pixel location p=(px,py) as input, and output the shift ratio Si(p) at that point. This is repeated for all 48 cameras.
Next, we select the object plane and a reference camera. For each nonreference camera, we match all the vertex locations from the target image captured at the object plane, to the corresponding vertex locations in the reference camera’s image. We can then fit polynomial equations Oi,x(p) and Oi,y(p) for each camera i to give the shift in pixels between the target image and reference image at a given pixel location. Taken together, these sets of polynomial equations provide a complete mapping from object to image space, and vice versa. A flowchart for the full calibration procedure is given in S2 in the Supplementary Material.
8.
Appendix C: Height Calculation and Projection
The axial location of a point in space can be calculated from this calibration model using its pixel location in any two cameras. If a point appears at pixel location pk=(pk,x,pk,y) in camera k and pn=(pn,x,pn,y) in camera n, then the height z can be given by either its x or y pixel locations
or
From there, if the magnification in camera k is Mk and the pixel pitch is ρ, we can calculate the lateral location of the point with
or equivalently using camera n in place of k.
We will now give a brief explanation of how projection can be performed using this calibration model, which is essential for our 3D reconstruction algorithm. Once a height is assigned to each pixel value, backward projection can be performed by shifting all pixels in each image to the object plane, according to the calculated shift ratio Si and the assigned height. Then, the pixels can be shifted again to the reference frame. Given a pixel location pi in image i with assigned height hp:
In other words, if a point that projects to location pi in image i were translated axially to the object plane, it would project to pixel location pi′. That same point on the object plane would project to location pref′ in the reference image, and the point at its actual height hp would project to pref in the reference image. Backward projection can then be performed by moving the value at every pixel location pi to its corresponding location in the reference coordinate frame, pref.
We perform this operation using Pytorch’s torch.nn.functional.grid_sample function to efficiently warp the images in a pixel-wise fashion. This function takes two inputs: (1) the image Ii to be warped and (2) a H×W×2 volume, where H and W are the height and width of image Ii in pixels, respectively. Each index in this volume must contain the normalized (px,py) pixel coordinates of the value in Ii, which should be warped to that index. Thus, to perform forward projection, we must start with a pixel location pref in the reference frame and specify the location pi in image Ii that warps to that a pixel location, rather than starting with pi in image Ii and using Eqs. (16)–(18) to find pref. This requires an equation of the form pi=pref+D(hp,pref), where D is some function that can be efficiently computed without prior knowledge of pi. To accomplish this, prior to the start of reconstruction, we generate a dense map S^i, where S^i(pref′)=Si(pi′), and O^i, where O^i(pref′)=Oi(pi′) (more details are given in S2 in the Supplementary Material). With the assumption that O^i(pref′)≈O^i(pref) and S^i(pref′)≈S^i(pref), our full backward projection for a given pixel can then be given as
where href is a height map in the reference point of view. To backproject image Ii, the value at location pi=(pi,x,pi,y) is moved to location pref=(pref,x,pref,y), for every x in {1:H} and y in {1:W}. We will notate this as
where I^i is image Ii backprojected to the reference frame using a height map href.
Although forward projection is not needed for the 3D reconstruction algorithm used in this work, it could be done by performing the back projection step in reverse. Starting with an image and height map in the reference frame, we can generate 48 images by shifting the pixels according to the calculated inter-camera pixel shifts, then shifting the pixels in the 48 images again according to their height and the shift ratio at that point. Concretely,
or, using S′ and O′:
9.
Appendix D: 3D Reconstruction
Although our calibration model can be applied to a wide range of 3D reconstruction algorithms, for this paper, we used a self-supervised approach adapted from MVS-Net, a previously published network for reconstructing height from MVS images.33–35,51 Although occurring on a different spatial scale (the MVS images are macro objects, such as buildings or furniture), the FiLM-Scope images share many similarities with the MVS datasets. In both cases, an image set consists of dozens of images of a mostly opaque object, captured by calibrated cameras viewing from different angular perspectives within a limited solid cone. Thus, self-supervised networks designed for MVS images adapted well to our use case.
MVS-Net begins using a single, shared convolutional neural network (CNN) to create multiple feature maps for each image, then back-projecting these maps through 3D space to form multiple feature volumes. These volumes are fed together into a 3D U-Net, which outputs a probability volume from which a height map can be extracted. The weights on both the shared CNN and the 3D U-Net are updated during training. The loss used varies between implementations, but it generally consists of the difference between an acquired image from a chosen perspective and another image warped to that perspective using the height map. These algorithms rely on the system calibration both to form the back-projected volume and to warp images between perspectives when computing the loss. We can use our custom calibration model to perform these steps.
In addition to our distinct calibration approach, our algorithm contains several key modifications to better suit our application space. First, we remove the feature extraction step with the CNN. This both speeds up reconstruction considerably, as a single volume is computed once at the beginning of reconstruction and repeatedly fed through the U-Net, and reduces the size of the 3D U-Net as we only need to feed in one volume instead of multiple feature volumes. We also found this approach to provide more accurate results for our samples likely because feature extraction removed rich photometric information from the images, which can assist in the 3D reconstruction of surfaces with few sharp features. Second, at this stage, we are not using MVS-Net in the traditional sense to perform training and inference; instead, we use the network as a deep image prior56 and start from a randomly initialized U-Net for each new sample. In this way, we can compute height maps for novel samples, without the need for a prior training dataset. In the future, as more data are generated with the FiLM-Scope, we can adapt our approach to allow for faster 3D reconstruction on a pre-trained network. Finally, the architecture of the 3D U-Net was modified to better match our needed axial resolution (see S3.1 in the Supplementary Material).
A diagram for our algorithm is shown in Fig. 3. First, we generate a digital z-stack Z of size H×W×N by performing back projection for all C cameras (typically 48) to N distinct planes over a defined spatial range, and summing the results for each camera, i.e.,
where Zn is a single plane at index n within Z, and hn is a flat height map for the height at plane n. The range over which Z is computed is chosen manually, according to the height of the object (S3.1 in the Supplementary Material). If the height is not previously known, it can easily be determined by creating a preliminary virtual z-stack over a large range and identifying the planes where the top and bottom of the object are in focus. Ideally, the spacing between planes in the volume should roughly correspond to the axial resolution of the system. However, the algorithm can still produce fairly precise results if larger plane spacing is needed to conserve GPU space.
We also compute a squared volume Z2 with planes
Then, a variance volume is formed:
where Zvar is fed into a randomly initialized 3D U-Net, which outputs a probability volume P of size H×W×N. A voxel at location (px,py,n) in the volume contains the likelihood of the object that projects to point (px,py) in the reference image existing at the height of plane n, such that
From this volume, we can output a height map href by multiplying each plane in the volume by its corresponding height hn, and summing all the planes. If Pn is a plane in volume P
This height map is used to back-project all C−1 nonreference images to the reference view, creating a set of “rectified” images, I^i=B(Ii,href).
A structural similarity loss between the rectified images and the reference image, along with an image-aware depth smoothing loss (Lsm), is used to update the weights on the U-Net. Both of these losses are commonly used in 3D height estimation algorithms and are detailed in Ref. 57:
Using this simple loss function allowed us to perform height reconstruction with minimal hyper-parameter tuning. λ2 and λ1 were held constant for samples shown in this work, and we expect this generalizability to hold over a large number of opaque sample types. In addition, when reconstructing height for videos with multiple frames, the network only needs to be trained from scratch on the first frame, and subsequent frames can use the previous frame’s network as a starting point, drastically speeding up reconstruction times.
For this paper, the algorithm was run on Nvidia GPUs with up to 48 GB of RAM. This memory was adequate to perform reconstructions over the full FOV in a single run with 4×4 downsampled images, but full-resolution images currently need to be reconstructed in patches. The number of patches needed is sample dependent as samples with larger height variations require a larger number of planes N in the variance volume Zvar, which quickly consumes RAM.
To improve the patching process, we can begin by computing a low-resolution height map href,low and then up-sample it to match the size of the full-resolution map. We then use this upsampled map, which we will notate as hguide, to guide the high-resolution patching process. With this approach, rather than creating Z and Z2 using flat height maps, each plane is a back projection to a curved surface offset from hguide. For instance, to create a volume with N planes spaced d mm apart, the height map to warp to plane n would be given by
Taking this approach allows us to greatly reduce the number of patches needed for samples with large height variation, as we only need to create a small volume directly around the low-resolution height map, rather than searching over a large height range. More information is given in S3.3 in the Supplementary Material.
As an additional benefit of our reconstruction algorithm, we can reduce noise in the reference view by summing all rectified images I^i, which are already generated during the course of reconstruction. Examples of these summed images compared with the raw images from a single view are shown in S3.4 in the Supplementary Material.