dc.description.abstract |
A Robust Multi-Camera Deep Person Re-Identification
Framework Using Spatiotemporal Context Modelling
In this thesis, research work is based on a robust automatic re-identification of
a person from multiple non-overlapping cameras under variable and dynamic
environmental conditions for an accurate re-identification and retrieval of
targeted person identities. Person Re-Identification (ReID) aims at identifying
query person of interest (POI) assigned with a unique identity label across
multiple non-overlapping cameras. The query POI can be either an image or a
video sequence. Person ReID has gained quite increasing attention among
various research and developer communities in recent years. Several research
challenges including occlusion, variable viewpoint, misalignment, unrestrained
poses, background clutter, etc. are the major challenges in developing robust,
lightweight, end-to-end trainable person ReID models. To address these issues,
an attention mechanism that comprises local part/region aggregated feature
representation learning is presented in this research by incorporating long-range
local and global context modeling. The part-aware local attention blocks are
aggregated into the widely used modified pre-trained ResNet50 CNN
architecture as a backbone employing two attention blocks i.e. Spatio-Temporal
Attention Module (STAM) and Channel Attention Module (CAM) thus
improving both local and global feature representation learning. The spatial
Attention block of STAM can learn contextual dependencies between different
human body parts regions like head, upper body, lower body, and shoes from a
single frame. On the other hand, the temporal attention modality is capable to
learn temporal contextual dependencies of the same person’s body parts across
all video frames. Lastly, the channel-based attention modality i.e. CAM can
model semantic connections between the channels of feature maps. These
STAM and CAM blocks are combined sequentially from a unified attention
network named Spatio-Temporal Channel Attention Network (STCANet) that
will be able to learn both short-range and long-range global feature maps
respectively. Extensive experiments are carried out to study the effectiveness of
STCANet on three images and two video-based benchmark datasets i.e. Market-
x
1501, DukeMTMC-ReID, MSMT17, DukeMTC-VideoReID, and MARS. K reciprocal re-ranking of gallery set is also applied in which the proposed
network showed significant improvement over these datasets in comparison to
the state-of-the-art by achieving (mAP/Rank-1) score of (95.5/94.5), (90.7/92.3)
and, (74.4/84.5) on Market-1501, DukeMTMC-ReID, and MSMT17 dataset
respectively. In addition, the proposed modified STCANet also showed
significant performance improvement in comparison to state-of-the-art methods
by achieving (mAP/Rank-1) score of (96.6/97.1), (85.3.7/89.1) on
DukeMTMC-VideoReID and MARS dataset respectively Lastly, to study the
generalizability of STCANet on unseen test instances, cross-validation on
external cohorts is also applied that showed the robustness of the proposed
model. The proposed STCANet is lightweight, end-to-end trainable, and can be
easily deployed to the real world for practical applications |
en_US |