E184 – NLP, BERT vs. SMITH and What it Means for Marketers W/ Dave Davies
Description
Recently Dave Davies wrote about Google’s BERT vs SMITH and how they work & work together. With so much coming out in recent years around BERT, SMITH and just NLP in general we thought this was a great time to get Dave on to dig into these topics a bit more.
<figcaption>Guest Host Dave Davies</figcaption></figure>To say he was happy to “geek” out over this topic was an understatement!
Dave Davies covers a ton during the episode and rather than trying to surmise all of his points much of what you see in this write up are from some notes Dave Rohrer took prior and during the episode. To get what Dave Davies actually said, scroll further down to the Transcript as it will be your best bet for sure this time.
Update on SMITH/Passages
Barely 1 hour after we finished recording this podcast Danny Sullivan as Google SearchLiaison posted that Passages had gone live – tweet is here.
<figure class="wp-block-image size-large">
</figure>What is NLP and Why Does an SEO/Marketer care?
This was one of the questions that Dave Rohrer wanted to dig into during the conversation and we believe we did touch on it. The short answer is going to be to structure you content as you should have all along. The long answer is that you need to listen to the episode or read the full transcript because along the way Dave gives some ways and reasons to optimize and think about NLP, BERT and SMITH.
What is BERT?
Search algorithm patent expert Bill Slawski (@bill_slawski of @GoFishDigital) described BERT like this:
“Bert is a natural language processing pre-training approach that can be used on a large body of text. It handles tasks such as entity recognition, part of speech tagging, and question-answering among other natural language processes. Bert helps Google understand natural language text from the Web.
– Bill Slawski
Google has open sourced this technology, and others have created variations of BERT.”
What is SMITH?
SMITH is also known as Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder. In a very simplified description, the SMITH model is trained to understand passages within the context of the entire document. SMITH and BERT are “similar” in a very simple way but where SMITH gets involved is in understanding long and complex documents, and long and complex queries. BERT as you will learn from Dave Davies is much better situated to focus on shorter pieces of content.
At its core, SMITH takes a document through the following process (paraphrased mostly from Dave Davie’s article):
- It breaks the document into grouping sizes it can handle, favoring sentences (i.e., if the document would allocate 4.5 sentences to a block based on length, it would truncate that to four).
- It then processes each sentence block individually.
- A transformer then learns the contextual representations of each block and turns them into a document representation.
BERT vs. SMITH
- BERT taps out at 256 tokens per document. After that, the computing cost gets too high for it to be functional, and often just isn’t.
- SMITH, on the other hand, can handle 2,248 tokens. The documents can be 8x larger.
- SMITH is the bazooka. It will paint the understanding of how things are. It is more costly in resources because it’s doing a bigger job, but is far less costly than BERT at doing that job.
- BERT will help SMITH do that, and assist in understanding short queries and content chunks.
- That is, until both are replaced, at which time we’ll move another leap forward and I’m going to bet that the next algorithm will be:
- Bidirected Object-agnostic Regresson-based transformer Gateways. (So Dave you are saying Google is going to create a BORG algo? Nice!!!)
Passage Ranking
The following passage about passages is from Google Passage Ranking Now Live in U.S.: 16 Key Facts:
Recently Bartosz then asks if tightening up the use of heading elements in order to better communicate what the different sections of a page are about will help or if Google will understand the content regardless of the markup.
“It’s pretty much that. With any kind of content some semantic and some structure in your content so that it’s easier for automated systems to understand the structure and the kind of like, the bits and pieces of your content.
But even if you would not do that we would still be able to say like… this part of the page is relevant to this query where this other piece of your page is not as relevant to this query.
Splitt next suggested that if you have a grasp on how to organize content that this (passages ranking) is pretty much not applicable.
Passages vs. Subtopics
He said Subtopics is a way of understanding things and Passages is a ranking thing. For more on Subtopics go read Google Confirmed Launching Subtopics Ranking In Mid-November 2020.
Additional and Mentioned Resources:
- Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
- Google BERT Update – What it Means
- No, Google SMITH Algorithm Is Not Live
- Google’s SMITH Algorithm Outperforms BERT
Full Transcript
Matt Siltala: [00:00:00 ] Welcome to another exciting episode of the business of digital podcast, featuring your host, Matt and Dave roar.
Hey guys, excited for everybody to join us on another one of these business of digital podcast episodes. And today we, that today’s going to confuse me because we have a bunch of Dave’s in the house. And first of all, I’d like to welcome Dave Davies, founder, or co-founder of. Beanstock and, uh, we got to make that clear, not want to get anyone in trouble, but a welcome Dave and other day.
Hi, how are you guys doing?
Dave Davies: [00:00:39 ] I can speak for myself. We’re doing well. And as Dave’s from our generation, I can say. And Mr. Roar, correct me if I’m wrong. You’re, you’re used to chatting with a multitude of other Daves at the same time,
Dave Rohrer: [00:00:50 ] there are a number of us out in the industry. So yes, it does happen from time to time.
Matt Siltala: [00:00:54 ] There’s quite a few, um, mats, but only, you know, very few with just one [00:01:00 ] T. And that actually uses two T sometimes because of like reputation management and how people search week. Anyway, it’s confused, but welcome, Dave. I’m glad to have you on, and I’m going to kind of toss this over to the other, Dave, Dave roar.
And, uh, there was a article that you wrote and I’m going to let him get into that and why he wanted to chat with you today. So, Uh, let’s do it
Dave Rohrer: [00:01:25 ] and I will let Mr. Davis give a quick intro, but also just a bit of a background about you’ve written about not just this, but some other similar topics in the past.
So I don’t know if you want to just kind of give, uh, your intro and then kind of give an intro into Smith Burt, but really start at the very beginning of what the heck and NLP is. I don’t keep saying NPL.
Dave Davies: [00:01:48 ] Sure. Um, well, I guess Matt did a, did a good intro. Um, I’m Dave from, from Beanstalk internet marketing and a co-founder with, with my wife.
So, so that parts, [00:02:00 ] parts. Awesome. And need to get that in. So now, now I won’t get in trouble
Matt Siltala: [00:02:02 ] and I must say that I really miss you guys. I miss seeing you guys at the conferences. I just, I have to throw it out that, that out there, because you know, I was thinking about it. And it seems like it just the year that we’ve had, or last year, it just kind of, you know, blink and it’s gone with, but a bit, but it’s been so long, but I was thinking about this and the last time that I actually saw you guys was, um, two pumpkins ago.
It’s crazy to think about that.























