AI in Web Search

Hi, There used to be a time when a group of friends at dinner could ask a question like“is a hot dog a sandwich?” and it would turn into a basic shouting match with lotsof gesturing and hypothetical examples. But now, we have access to a LOT of humanknowledge in the palm of our hands…

so our friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none of them have a connected bun like hot dogs(disappointed). Search engines are a huge part of modern life.They help us access information, find directions to places, shop, and participate in sandwicharguments.

But how does Google find answers to questions?How are Siri and Alexa so smart but also easily stumped? How did IBM’s Watson beat the bestJeopardy players in the world? Well, search engines are just AI systems thatare getting better and better at helping us find what we’re looking for.

INTRO When we talk about search engines, we typicallythink about the AI systems online, like Google, Bing, Duck Duck Go and Ask Jeeves. But the basic ideas behind non-AI search engineshave existed for centuries. Essentially, search engines gather data, create organization systemsto sort that data, and find results to a question. For example, when you needed an answer toa question and couldn’t search online, you could go to the library! Libraries gatherdata in the form of books and newspapers that are stacked neatly on the shelves.

Librarians have organization systems to helpyou find what you’re looking for. Knowing that magazines are on shelves by the waterfountain, while kids books are on the second floor is a kind of organization. Plus, fictionbooks are sorted by the author’s last name, while nonfiction has the Dewey Decimal System,and so on. Once you (or the librarian) have the resourcesyou need, you’ll be able to find results to your question! Now, rather than looking through books, websearch engines look through all the data on the World Wide Web, aka “the Web”. Andinstead of asking a human librarian where to find information, we ask an AI like John-Green-botinstead. Jabril: Oh John Green Bot? [JGB dialup beeps] Alright John Green Bot you’re all set.

We’re going to need that later. And just so we’re clear, we’re using “Web”throughout this video even though it might sound a little old-fashioned. That’s becausethe Internet and the Web are not the same thing. The Internet is a collection of computersthat send messages to each other. Video services like Netflix that play on your TV, for example,use the Internet, not the Web. The Web, on the other hand, is part of theInternet and uses the Internet’s connections to send documents and other content in a formatthat can be displayed by a browser like Chrome or Safari. As with most AI systems, the first step isto gather lots of data.

To gather data on the Web, we can use a computer program calleda Web crawler, which systematically finds and downloads Web pages. This is a HUGE taskand happens before the search engine AI can take any questions. It starts on some Web page that we pick, calleda seed, and downloads that page and finds all its links. Then, the crawler downloadseach of the linked Web pages and finds their links, and so on… until we’ve crawledthe whole Web. After we have collected all the data, theAI’s next step is to organize it by building an index, which is a kind of lookup system.The kind that’s used for organizing Web pages is called an inverted index, which islike the index in the back of a textbook. For each word, it lists all of the Web pagesthat contain that word.

Usually, the Web pages are represented by I.D. numbers so we don’thave a long, messy list of URLs. Let’s say 0 is the seed – which happensto be a page about Genghis Khan. It has a lot of words on it like “the, mongol, Khan,Genghis, who, and is”. In this inverted index, page 1 is about Marco Polo, but itmentions the word “Genghis” along with words like “the, Marco, Polo, who, are,and is.” Page 2 is about the Mongols, page 3 is a different webpage about Marco Polo,and page 4 is about Water Polo.

So, let’s say we type “Who is GenghisKhan?” into a search engine. Our AI can use this inverted index to findresults, which in this case, are links to Web pages. The AI will look at the words “who”,“is”, “Genghis”, and “Khan” and use the inverted index to find relevant pages. Our AI might find that Web pages zero, one,two and five have at least one of the words from the question “who is Genghis Khan?”When Siri says “I found this for you,” the AI is just returning a list of Web pagesthat contain the same terms as the question. Except… most search engines include onemore step. There are millions of pages online that contain the same terms.

So it’s importantfor search engines to rank Web pages, so that the top result is more likely to be relevant than the tenth result or the hundredth. Of course, Google and Bing don’t hire “supervisors”to grade each possible question and answer to help their AI systems learn from trainingdata. That would take forever, and they wouldn’t be able to keep up with all the new contentthat gets created every day. Really, regular users like us do this trainingfor free all the time. Every time we use a search engine, our behavior tells the AIwhether or not the results answered our question.

For example, if we type in “who is GenghisKhan” into a search engine, and click on a Web page about Star Trek II: The Wrath ofKhan, we might be disappointed to find Genghis Khan isn’t ANYWHERE in that movie. So we’llbounce back to the search results, and try again until we find a page that answers ourquestion. A bounce indicates a bad result.

But if weclick on a Wikipedia article about Genghis Khan and stay for a while reading, that’sa click through, which probably means that we found what we were looking for… so thatindicates a good result. Human behavior like bounces and click throughsgive AI systems the training data they need to learn how to rank search results and better answer our questions. Data from the Web and data from how we use the Web helps makebetter and better search engines. Now, sometimes we ask our smart devices questionsand we want actual answers… not links to Web pages.

When I say “OK Google, what’sthe weather like in Indianapolis?” I don’t want to scroll through results. For this kind of problem, instead of usingan inverted index, AIs rely on knowledge bases. Which you might remember from our video aboutSymbolic AI. A knowledge base encodes information about the universe as relationships between objects like “chocolate donut” and “John Green Bot wears polo”. One of the main problems with knowledge basesis that it’s really hard to write down all of the facts in the universe, especially commonsense things that humans take for granted but computers need to be told. Enter AI researcher Tom Mitchell and his teamof scientists from Carnegie Mellon University.

In 2010, they created a huge knowledge basecalled the Never Ending Language Learner or NELL, which was able to extract hundreds ofthousands of facts from random Web pages. The way it works is really clever, so let’sgo to the Thought Bubble to see how. NELL starts with some facts provided by ahuman, for example, the genre of music that Mozart plays is classical. Which was representedlike this: Mozart. musicGenre. Classical. Similarly, Jimi Hendrix. plays. Guitar. And Darth Vader. hasChild. Luke Skywalker. Then, NELL gets to work and reads througheach Web page one-by-one for words mentioned in those facts.

Maybe it finds the text “Mozart plays the piano.” NELL doesn’t know much about these symbols,but this text matches the same pattern as one of the facts provided by a human, specifically,the “plays” relationship. So NELL learns a new object: Piano. And a new fact: Mozart.plays. Piano. By searching over the entire Web, NELL canlearn lots of facts based on just the three original ones that humans gave it! Some facts might appear hundreds or thousandsof times online, like Lenny Kravitz. hasChild. Zoë Kravitz. But NELL might also find factsthat are mentioned SOMEWHERE online and extract them as potentially true.

Like, for example,Darth Vader. plays. Kloo Horn. We just don’t know! Just like how we look for multiple sourceswhen writing a paper, NELL uses repetition and multiple sources to build confidence thatthe facts it’s finding are actually true. To consider other relationships, NELL usesthe highly confident facts it learned and searches through the Web again. Only thistime, NELL is looking for new relationships. Maybe it finds the text “Darth Vader cutsoff Luke Skywalker’s hand,” and NELL learns a new (very specific) relationship: cutsOffHand. Over and over again, NELL will use known relationshipsto find new objects, and known objects to find new relationships — creating a hugeknowledge base. Thanks, Thought Bubble! AI systems can usehuge knowledge bases, like this one extracted by NELL, to answer our questions directly.

Instead of using the words from our questionsto search through an inverted index, an AI like Siri can reformulate our questions intoincomplete facts and then look for matches in a knowledge base. Hey John Green Bot…. John Green Bot: Yes, Jabril? Jabril: “Who wrote The Bluest Eye?” His AI could then reformulate that questioninto an incomplete fact, replacing “who” with a question mark. If John-Green-bot extractedthat information earlier, he can find matches in his knowledge base and return the mostconfident result. John-Green-bot: Toni Morrison wrote The BluestEye! Jabril: Hey. Thanks, John-Green-bot! Different words are categorized differently,so an AI like John-Green-bot can tell the difference between questions asking “who”and “when” and “where.”

But that gets more complicated, so we’re not going todive into the details here. If you want to learn more, you can read about part of speechtagging systems. Using all these strategies, search engineshave become really good at answering common questions. But questions like “How manytrees are in Ohio?” or “How many hotdogs are eaten in the South Sandwich Islands annually?”still stump most AI systems, because not enough people ask them and AI hasn’t learned howto answer them well yet. It’s also important to watch out for searchengine answers to questions like “Who invented the time machine?” because AI systems havea tough time with nuance and incomplete data.

Sorry Doc Brown. And a big, sort of hidden, problem is thatsearch engine AI systems, are influenced by any biases in data online. For example, ifI ask Google for images of “nurses,” it will mostly show pictures of female nurses.

Leave a Comment

Your email address will not be published. Required fields are marked *