Multi-instance object tracking is an active research problem in computer vision, where most novel methods analyze and locate targets on videos taken from static camera set-ups, just as many existing monitoring systems worldwide. These have proved efficient and effective for many established monitoring systems worldwide, such as animal behavior studies and human and road traffic. However, despite the growing success of computer vision in animal monitoring and behavior analysis, such a system has yet to be developed for free-ranging Japanese macaques. With this, our study aims to establish a tracking system for Japanese macaques in their natural habitat. We begin by training a monkey detector using You Only Look Once (YOLOv4) and investigating the effect of different transfer learning techniques, curriculum learning, and dataset heterogeneity to improve the model’s accuracy. Using the resulting box detections from our monkey detection model, we use SuperGlue and Murty’s algorithm for re-identifying the monkey individuals across the succeeding frames. With a mean AP50\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$AP^{50}$$\end{document} of 96.59%, a precision score of 93%, a recall of 96%, and a mean IOUAP@50\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$IOU_{AP@50}$$\end{document} of 77.2%, our Japanese macaque detection model trained using a YOLO-v4 architecture with spatial attention module, and Mish activation function based on 3-stage training curriculum yielded the best performance. For animal behavior studies, our tracking system can prove effective and reliable with our achieved 91.35% MOTA even on our heterogeneous dataset.