We introduce a novel framework for 3D scene reconstruction with simultaneous object annotation,using a pre-trained 2D convolutional neural network(CNN),incremental data streaming,and remote exploration,with a virtual reality setup.It enables versatile integration of any 2D box detection or segmentation network.We integrate new approaches to(i)asynchronously perform dense 3D-reconstruction and object annotation at interactive frame rates,(ii)efficiently optimize CNN results in terms of object prediction and spatial accuracy,and(iii)generate computationally-efficient colliders in large triangulated 3D-reconstructions at run-time for 3D scene interaction.Our method is novel in combining CNNs with long and varying inference time with live 3D-reconstruction from RGB-D camera input.We further propose a lightweight data structure to store the 3D-reconstruction data and object annotations to enable fast incremental data transmission for real-time exploration with a remote client,which has not been presented before.Our framework achieves update rates of 22 fps(SSD Mobile Net)and 19 fps(Mask RCNN)for indoor environments up to 800 m3.We evaluated the accuracy of 3D-object detection.Our work provides a versatile foundation for semantic scene understanding of large streamed 3D-reconstructions,while being independent from the CNN's processing time.Source code is available for non-commercial use.