 BERT explained in 60 seconds. Transformer neural networks can translate one sequence to another. Sequences can be words, like for example it can translate English to French. But these transformers weren't designed to solve language problems, and we need a lot of data for each language task. To solve this, we can stack a bunch of transformer encoders, and we get a bi-directional encoder representation of transformers, or BERT. We then want to pre-train BERT to understand language, and we can do this by training on two objectives of natural language inference and sentence text similarity. We then fine-tune this to solve other language problems. When fine-tuning, we don't really need to worry about requiring too much data, and this architecture is complex enough to capture patterns in language to perform well on general language tasks.