Stemming – We’re going to talk about stemming, an important part of pre-processing text. So there are words that share a stem, for example, in consult, consultant, consultants, consulted, all of them share a single stem, consult. This means that all of them are related in meaning somehow and that, for example, in a web search where we search for consultant, we also want the word consultants to be included in those searches. This is what stemming can do for us.
Let’s just do a short review of what stemming is. So in – there’s an area of linguistics called morphology and it studies the parts of words. So words can be separated into their individual components, and we call each of these components a morpheme if they’re a minimal unit of meaning. Morphemes can be of two types: they can be stems which are the core meaning bearing units in the word, so for example in unbearably, it’s
– the stem is bear because it’s – this is to support something or to carry something, and by the way remember that in English roots and stems are roughly equivalent.
So a word has a stem and it also has affixes, which are like grammatical functions that are appended to the word. So in unbearably, bear is the stem, un is a prefix that tells you that something is the negative of what we’re talking about, able is a suffix that tells you that you can do something, and y makes the word an adjective or an adverb.
So why do we need to reduce words to their stems sometimes? This is because words that have the same stem are related and we probably want them to be classified together. So if we have a search engine for example and we search for word like tacos, we probably want the website to return documents that include the word tacos like in number three, three amazing tacos for you, but we also want the website to return documents that include the word taco, the stem of tacos. Taco sale tomorrow, best taco recipes. These two words are related because they share a stem and we want the computer to to be able to identify this.
So sometimes we need to transform our text into its stems to perform stemming for the computer to see
these relationships. So again without performing stemming, the computer could not see that taco and tacos refer to the same object.
There’s several algorithms to implement stemming. A very common algorithm in English is called Porter’s algorithm. So I’m leaving it – in the lower left of your screen. You have the website where we’re going to demo the stemming and the actual implementation of – of the algorithm which if you want to take a look at how it’s actually implemented I leave it for you there. So what this algorithm does is it uses regular expressions to try to decompose words and get to their stems basically.
So if you get a word like caresses for example you’ll transform the sses into just the ss. Caresses to caress. If it gets something like ponies, it will tran – it will replace the ies for an y. Ponies to pony. If you get something like cats, you – you will transform cats to cat.
It does – then performs another step, again with regular expressions, where it tries to find words that it knows are verbs. For example, in a word list, and then does then – then replaces the ing in walking for zero, so it takes walking and makes walk. It takes plastered and makes plaster replacing the ed to zero.
In another step it takes usual morphemes and simplifies them, so for example ational it replaces it for ate, relational relate, digitizer digitize, operator operate. And for – it takes some other morphemes and transforms them to zero. For example revival it removes it so that the stem would be reviv, adjustable it makes it adjust, activate activ.
So let me show you how this algorithm would work. The Porter. And again this is just one possible algorithm. This is one of the websites that we have on the lower left of the slides, and we’re gonna write a simple sentence about tacos, why not. Tacos are simply amazing. Have you tried tacos, question mark. So we’re gonna ask it to – to use the Porter stemmer and we’ll see what happens.
As you can see, tacos is stemmed into taco are simpli, so the are is not replaced by to be, this is because of the specific implementation of the Porter. Simply is changed to simpli which is the stem for things like simplify, simplification, simple and so forth. Amazing is replaced with amaze and this is – it’s probably consuming a little bit too much but it is the – amaze is the stem for amazing, it’s – you have other words like amazement or amaze.
Have you tried so tried is simplified to tri and tacos is simplified to taco. So if you do this with a document you will be able to identify that both of them contain contain the word taco even if sometimes the word is taco and sometimes tacos. Again the Porter is just one algorithm to implement stemming. Let’s try another English one: the Lancaster algorithm for example. As you can see here, we get slightly different decisions. So ar for are. We still have amaze and singular taco, and now we have hav as the stem for have with an E, and this is correct this is because have also appears in having haved for example. Tri taco is simplified.
So another very common one is this normal stammer and you can see that it exists for other languages as well. This is related to the Porter as a matter of fact that’s an evolution of the Porter, so it’s – the results are identical I think. Taco ar simpli amaz hav you tri taco. So again we can use this to find further similarities across text. If you want to look at the specific implementation of algorithms like Porter, I leave you with the URL on the presentation. Here is the part of the Porter algorithm where you can see that some of the morphemes are transformed to shorter forms.
So in summary, stemming transforms tokens into their stems and this can be useful when you conduct searches so that the computer can identify that tacos and taco are related. This process is language dependent so notice that I chose English alternatives and you do need a specific stemmer for the language you’re working with.This process might be very complex so we looked at a very simple example in English but imagine what a stemming algorithm would look like in Arabic.
It would need to figure out what the root consonants are and get rid of the vowels in the middle. So this process is heavily language-dependent and if you run – if you’re using a small language maybe you will not be able to perform stemming. Thank you for your time and I’ll see you in the chat forum and I’ll see you next week in more videos. Have a good evening, afternoon, morning.
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!