Any of those help This is a pretty well known problem in computer vision. There are various papers you can refer to for this, including systems that do simultaneous localisation and mapping (SLAM), which may use either bundle adjustment or filter-based tracking. Reading up popular papers on these topics will give you a lot of insight into cameras and tracking in the real world. To summarise, you will need to obtain the 6D pose of the camera in every frame i.e. you need to figure out where the camera is in the real world (translation), and where it is pointing (rotation). This is usually done by first tracking salient features in the scene, estimating their 3D positions and then using the perceived motion of these features to figure out the camera pose in every frame. You will need to define an origin (you cannot use the camera as the origin for the problem you're trying to solve) in the real world and have at least 4 known/measured points as a reference to start with. In the video you've included in your question, Augment seem to use a printed pattern to get the initial camera pose. They then track features in the real world to continue tracking the pose.
google Cloud Vision API: node.js and an image URI, how to invoke vision.detectText()?
I think the issue was by ths following , You need to upgrade your torchvision package as VisionDataset was introduced in torchvision 0.3.0 in PR#749 as a base class for all datasets. Check the release.