Understanding a visual scene requires not only identifying single objects in isolation but also inferring the relationships and interactions between object pairs. In this study, we propose a novel scene graph generation framework based on Transformer to convert image data into linguistic descriptions characterized as nodes and edges of a graph describing the information of the given image. The proposed model consists of three components. First, we propose an enhanced object detection module with bidirectional long short-term memory (Bi-LSTM) for object-to-object information exchange to generate the classification probabilities for object bounding boxes and classes. Second, we introduce a novel context information capture module containing Transformer layers that outputs object categories containing object context as well as edge information for specific object pairs with context. Finally, since the relationship frequencies follow a long-tailed distribution, an adaptive inference module with a special feature fusion strategy is designed to soften the distribution and perform adaptive reasoning about relationship classification based on the visual appearance of object pairs. We have conducted detailed experiments on three popular open-source datasets, namely, Visual Genome, OpenImages, and Visual Relationship Detection, and have performed ablation experiments on each module, demonstrating significant improvements under different settings and in terms of various metrics.