 Hi everyone, my name is Kershing Yi. I'm a PhD student from Harvard University. Today, I am presenting to you our recent work on a new dataset called Cleverer, which is a video dataset for temporal and causal reasoning. This work is in collaboration with MIT, IBM, and DeepMind. When humans look at the video shown here, not only can we answer questions to describe the events that happen, but we can also reason about the causal structure behind the events, making predictions about the future, and imagining unobserved counterfactual scenarios. This naturally leads to the question, can machines perform causal reasoning tasks in the way that humans are capable of? Recently, there has been a lot of efforts on building datasets for racial question answering on both static images and videos. Despite these huge progress, these datasets are either biased and noisy, requiring heavy human annotations, or they lack complex causal and temporal contexts. In this work, we introduced a new dataset called Cleverer to facilitate research on temporal and causal reasoning on videos. The dataset consists of synthetic videos and questions with well-controlled biases. It also provides the ground-truth annotations of the video states for model diagnostics. The questions includes four different types of tasks focusing on causal reasoning, including descriptive, explanatory, predictive, and counterfactual. Here you can find an example of the videos from the dataset. Among the four types of the questions, descriptive questions ask the model to describe an event that happened during the video. Explanatory questions ask the model to explain the cause of a specified event by choosing another event or object that is responsible to it. In this example, the collision between the rubber sphere and sine cylinder pushes the cylinder to make another collision with the other metal cylinder. Therefore, we regard the second collision to be caused by the first. Predictive questions queries about future events after the video terminates. In this example, the metal sphere will keep moving along the white arrow and collide with the red sphere. Counterfactual questions require the model to imagine the outcome when an object is removed from the scene and select the event that will happen under the unobserved scenario. We evaluate the performance of a wide range of baseline models, including the current state of the art models on video QA datasets. While the MAC network achieves more than 85 percent accuracy on the descriptive questions, all models perform poorly on the remaining explanatory, predictive, and counterfactual tasks. Our evaluation results suggest these causal tasks require the model to capture the dynamics of the video's internal state and be able to make future predictions. Following this observation, we designed an oracle model called Neurosymbolic Dynamics Recently. Our model first pars each video frame into an object-centric abstract representation that includes the position information and static attributes such as color and shape of each objects. Then we send the representation to a dynamics model to predict the object trajectories across different time steps as well as the collision event throughout the video using a graph neural network. The input question is parsing to a program representing its underlying logic by sequence-to-sequence network. Finally, the program is executed by symbolic executor on the predicted motion trajectories to the output announcer. Let's look at an example of how NSDR performing counterfactual reasons. Here, the upper window shows the input video and the lower window shows the predicted trajectory when the sine cylinder is removed. Our model is able to capture the counterfactual event without direct observation, and answer the question correctly. Quantitative evaluation shows our model improves the performance on cleverer, especially on the more challenging explanatory, predictive, and counterfactual questions. This shows that dynamics modeling is an essential missing component for solving these tasks. In summary, we propose a new dataset for studying causal reasoning on videos and study the novel model on this task. Our experiments show that applying dynamics modeling on object-centric state representation of the videos sheds light on causal reasoning. Cleverer is now available online. Please visit our website for download and further details of the dataset. Thank you.