Wu, Patrick Y., and Walter R. Mebane. “MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks”. Computational Communication Research (old website) 4, no. 1 (May 3, 2022): 275–322. Accessed June 10, 2025. http://bubble.labs.vu.nl/ccr/article/view/102.