Who, What and Where (3W)are the three core elements of storytelling, and accurately identifying the 3W semantics is critical to understanding the story in a video. This paper studies the 3W composite-semantics video Instance Search (INS) problem, which aims to find video shots about a specific person doing a concrete action in a particular location. The popular Complete-Decomposition (CD) methods divide a composite-semantics query into multiple single-semantics queries, which are likely to yield inaccurate or incomplete retrieval results due to neglecting important semantic correlations. Recent Non-Decomposition (ND) methods utilize Vision Language Model (VLM) to directly measure the similarity between textual query and video content. However, the accuracy is limited by VLM's immature capability to recognize fine-grained objects. To address the above challenges, we propose a video structure-aware Partial-Decomposition (PD) method. Its core idea is to partially decompose the 3W INS problem into three semantic-correlated 2W INS problems i.e., person-action INS, action-location INS, and location-person INS. Thereafter, we respectively model the correlations between pairs of semantics at frames, shots and scenes of story videos. With the help of the spatial consistency and temporal continuity contained in the unique hierarchical structure of story videos, we can finally obtain identity-matching, logic-consistent, and content-coherent 3W INS results. To validate the effectiveness of the proposed method, we specifically build three large-scale 3W INS datasets based on three TV series Eastenders, Friends and The Big Bang Theory, totally comprising over 670K video shots spanning 700 hours. Extensive experiments show that the proposed PD method surpasses the current state-of-the-art CD and ND methods for 3W INS in story videos.