 There are lots of varieties of BERT and other competing transformer models Basically very similar. We talked about BERT using self-attention and masking in the worksheet we mostly used Roberta, which has the same architecture, same size as BERT, but uses a bike pair encoding, slightly different, trained on more data, which is the key part. There's also Distil BERT, which as you recall from distillation, that was a long time ago, is a rather compressed smaller BERT, much smaller, it runs faster, less memory footprint, a little less accurate, but for lots of applications people prefer speed over accuracy. There are dozens of different pre-trained BERTs, Alberta, lots of funny names, and if you go to Huggingface you can get lots of them pre-trained, ready to use, give them a try, see what the trade-off is between speed and accuracy.