Pest insect monitoring and control is crucial to ensure a safe and profitable crop growth in all plantation types, as well as guarantee food quality and limited use of pesticides. We aim at extending traditional monitoring by means of traps, by involving the general public in reporting the presence of insects by using smartphones. This includes the largely unexplored problem of detecting insects in images that are taken in non-controlled conditions. Furthermore, pest insects are, in many cases, extremely similar to other species that are harmless. Therefore, computer vision algorithms must not be fooled by these similar insects, not to raise unmotivated alarms. In this work, we study the capabilities of state-of-the-art (SoA) object detection models based on convolutional neural networks (CNN) for the task of detecting beetle-like pest insects on non-homogeneous images taken outdoors by different sources. Moreover, we focus on disambiguating a pest insect from similar harmless species. We consider not only detection performance of different models, but also required computational resources. This study aims at providing a baseline model for this kind of tasks. Our results show the suitability of current SoA models for this application, highlighting how FasterRCNN with a MobileNetV3 backbone is a particularly good starting point for accuracy and inference execution latency. This combination provided a mean average precision score of 92.66% that can be considered qualitatively at least as good as the score obtained by other authors that adopted more specific models.