Improving Open-set and Closed-set Object Detection of Child Violence with Synthetic Data
Read Full Paper Here
Photo by RDNE Stock project: https://www.pexels.com/photo/woman-s-palm-with-written-message-6003790/
Child violence is a critical issue with underreported cases and widespread online distribution.
According to the World Health Organization, 1 in 2 children aged 2-17 years across the world suffer some form of violence each year.
In the Americas alone, an estimated 58% of children in Latin America and 61% in North America report having experienced physical, sexual, and/or emotional abuse in the past year.
The need for computer vision models that can detect images of violence against children and Child Sexual Abuse Material (CSAM) is both critical and urgent.
The current approach used in content moderation is hash matching, which requires extensive human labor to manually classify content and update hashes. This time-consuming labor has been known to induce traumatic mental health outcomes for human moderators.
With the tremendous rise of AI-generated deepfakes, 96% of which constitute explicit sexual material, content depicting violence against children is already proliferating.
what if we fight fire with fire by leveraging AI-generated data and simulated violence against children before such media gets distributed?
In our project, we investigate how much image classification of violence against children can be improved by leveraging synthetic data. Since we do not have access to real CSAM material, we utilize a ”proxy” dataset of real children involved in non-sexual, physical violence.
The dataset consists of just 500 images, which, as we describe later, was difficult to obtain and required scouring foreign websites. We frame CSAM detection as an object detection task followed by classification; in particular, given an image, we use an object detector to first output bounding boxes around people engaged in violence, and then classify whether any of the people are children. For object detection, we assess two models: GroundingDINO, a multimodal open-set model which supports zero-shot detection using natural language, and YOLOv5, a closed-set model which can detect categories specified in a dataset. For synthetic data, we compare two generation methods: Stable diffusion, a multimodal text-to-image generator based on latent diffusion models, and a custom pipeline we developed that generates images of violence from Unity, a game engine, and then converts them to photo-realistic images using CycleGAN, a generative adversarial network designed for domain transfer.
Result Summary
Baseline results
Confusion matrix
Another important aspect of our research was exploring the feasibility of synthetic data generation for our topic. We observed that adding images generated from Stable Diffusion v1.4 into our dataset provided slight increases in performance for both YOLOv5 and Grounding DINO, while the Unity images slightly detracted from the performance of YOLOv5. However, increasing the quantity of synthetic images by adding both sets of images from the two generation types provided an even larger increase in performance than just Stable Diffusion alone, highlighting the benefit of scaling data.
Although violence detection is inherently time-based, we chose to work with images due to data size, model complexity, and training time constraints on Google Colab’s GPU. In future work, we plan to explore temporal information using 3D CNNs and RNNs to enhance accuracy by capturing dynamics across frames. Additionally, the size of our dataset was severely limited by the fact that we had to compile it ourselves. With more people and compute, we could have assembled a larger synthetic dataset manifesting a greater diversity of child violence, which might have enabled our models to generalize better to unseen cases.
For more detailed information and explanation, please read the full paper from link below.
As for results. In our investigation of improving CSAM detection with synthetic data, we examined two methods for detecting CSAM by proxy. One method was a fusion of two models, where we used a fine-tuned ResNet to detect images of violence in our dataset and then had YOLOv5 detect if there were any children in these images. From our results, this approach provided better F1 scores than using just a YOLOv5 model, as the model seemed to focus more on the concrete task of detecting children rather than trying to detect child violence, which is more nuanced. Our second method was to fine-tune a state-of-the-art multimodal, open-set object detection model, GroundingDINO, and to evaluate it on our child violence test set.
We found that GroundingDINO performed worse than the ResNet-YOLOv5 fusion approach in detecting child violence. This may be due to the complexity of the child violence detection task across different modalities, or the linguistic ambiguity of the textual prompts used. On the other hand, the ResNet-YOLOv5 fusion method benefits from the distinct separation of tasks, which is in some ways cleaner than the ”two stage” dual-prompt process that GroundingDINO employs.