Introduction to machine learning: Classification of news with the help of the working environment Weka

Posted on May 5, 2015September 27, 2017 Nikola Milošević BACKEND

Previous years machine learning has given quite good results in facilitating certain businesses, predicting events and brought big savings in various areas. The computer is now possible to learn to operate a fairly large number of jobs almost as effective as a man. Although we are still quite far from strong machine intelligence, which would be comparable to a man’s, many jobs can be delegated to machines, without fear that they will she do worse than men.

Types of machine learning

Generally machine learning algorithms are divided into supervised and unsupervised.

Supervised algorithms are necessary guiding example, the so-called training set. In the training set are data with the resolution or output. Based on these data, the algorithm is able to create a model, according to which the output data are generated and which will in the best possible way to depict the mapping data and expected outputs from the training set.

Unsupervised algorithms do not use the training set of data, but they are capable of using a given data set make certain conclusions and create the model. For example, if the coordinate system there are certain groups of dots, these algorithms will be able to recognize them. For certain problems are great, however, for some other problems, it is necessary to use supervizovano learning.

Half- supervised learning is a combination supervizovanog and nesupervizovanog learning. Some authors of these algorithms emerge as the third group, but they are only a combination of these. It is often used when a small training set. In this case, on a small training set created model, to provide, after missed other data through an algorithm. If the algorithm is secure enough that he correctly predicted some out, these data are added to the training set, and an algorithm trained on the expanded training set again. And this is repeated until a satisfactory large training set and satisfactorily accurate model.

Also, it is interesting to note that machine learning algorithms are not new, although lately there have been enhancers. However, the most commonly used algorithms, which are actually the ones who bring the best results there over 50 years. So Naive Bayes is based on Bayesovoj theorem, which was published in its current form in 1812, while the algorithm itself from the 1960s. Warren McCulloch and Walter Pitts in 1943 created a computer model of neural networks. One of the most widely used algorithms for klasivikaciju the support vector machines (SVM), which appeared in 1963 in a paper by Russian scientists Vladimir N. Vapnik and Alexey Ya. Chervonenkis. However, the algorithm has arrived to the west, until 1992, several years after the fall of the Berlin Wall.

Problems which solve machine learning

The main problem that machine learning is trying to solve is that based on the data that would teach the outcome and the man gave. When we talk about supervizovanom learning, then training set is called historical data, while new data needed to predict the outcome. So we have a problem of prediction. With predictions there are two types – classification and regression. The classification refers to a discrete prediction or classification of data in defined, finite number of classes. Regression, on the other hand, refers to the continuous prediction. In regression algorithm learns and predicts how it will function behave in the future, based on historical data. Generally, other problems that machine learning can solve, such as clustering, can be reduced to a classification or regression.

Example of text classification

To close work with machine learning, we’ll do a relatively simple example. We create a classifier for news in Serbian. We create the crawler with the help Scrapy framework, which will collect news from the B92 website. Our classifier will have three classes – information (policies and events), sports and technical news. We’ll make a training set of 100 texts from each category and make the SVM model with the help of Weka working environment. Finally, we will try to improve the accuracy of prediction bit.

Data collection

In order to make the training set, we’ll write a simple crawler, which will collect articles articles. Crawler has the capacity to pull off his article, can follow all the links on that page and so to find new articles until it picks up everything. Of course, it can also be limited to specific domains and addresses that contain specific text.

For crawler I used Scrapy framework, which makes it easy to create a crawler in the programming language Python. It can be installed with the help of pip package manager. To install, you must run the command “pip install scrapy”. After that, you can create a project, so that the command line enter the command “scrapy startproject crawler”. Scrapy will create a folder with a specific crawler folderskom structure. Then in the directory spiders, designed previous command, you can add modules, which will define the logic of our crawler. This file looks at us as follows:

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors import LinkExtractor

from scrapy import log

from scrapy.tests.mockserver import Follow


class B92InfoSpider(CrawlSpider):

    name = "b92info"

    allowed_domains = ["b92.net"]

    start_urls = [

        "http://www.b92.net/info/",             

    ]  

    rules = (

        # Extract links matching 'item.php' and parse them with the spider's method parse_item

        Rule(LinkExtractor(allow=('info/vesti/', )),

callback='parse_item',follow=True),

    )

    def parse_item(self, response):

        log.msg("URL:"+response.url, level=log.DEBUG)

       filename = "files_b92_info/"+(response.url.split("/")[-1]+"_crawled.html").replace("?","")

       .replace("=","")

        with open(filename, 'wb') as f:

            f.write(response.body)

Defines a class that is derived from CrawlSpidera (There are other classes, but for us this is necessary because it allows monitoring of links on each page). The class must have a name, which is essential to run. It should also define which domains are allowed and from which address ranges. After we define the rules. Our rule says that he will follow links containing info / news, that will be on all pages that are crawlovane follow further links, until all are exhausted (follow = true), and that the function will be executed parse_item when parsing the page. In the parse item, just keep the page content on the local hard disk in a specific directory. Similarly we have made and crawlers sports and Technopolis part B92. Only the rules of argument slightly. Crawler runs the command “scrapy crawl B92Info”.

Once the files with the news collected, they need to be converted from HTML forms in text files. Given that the text of the news on B92 are in a giant tag with class-text article, we can do it in the following way:

from bs4 import BeautifulSoup

from os import listdir

from os.path import isfile, join

mypath = "selected_info/"

onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]

for file in onlyfiles:

    file_object = open(mypath+file, "r")

    sdata = file_object.read()

    soup = BeautifulSoup(sdata)

    mydivs = soup.findAll("div", { "class" : "article-text" })

    with open("b92_info_txt/"+file.split("/")[-1]+".txt", 'wb') as f:

        if mydivs != None and len(mydivs)>0:

            f.write(mydivs[0].text.encode("utf8"))

print "Done!"

Thus stored documents can already be used for learning. However, I did several transformations, such as cleaning the javascript code and converting Serbian character (c, c, s, f, s) in a combination of ASCII characters (cx, cy, sx, dx, sx), to avoid problems with the encoding.

Classification

Thus the resulting files are sufficient to begin the process of machine learning. For machine learning will use the Weka working environment. Weka is a working environment developed at the University of Waikato in New Zealand. This work environment, or developed in the Java programming language, contains a large collection of algorithms for machine learning, as well as tools, filters and libraries necessary in the process of machine learning. Given that we use the Weka work environment, we must first convert the data to the port, which Weka can understand. To learn Weka uses .arff file, which contains information about the properties, as well as data. Given that our data now in 3 directory and many files within these folders, this can be quite hard to manually convert the .arff file but Weka already contains script, which from this collection can make the appropriate input. In the command line you need to enter something like this:

Klasifikacija

This will b29_unstemmed within the folder where the folders info, sports and techno make train.arff file, which will serve as input.

After this, the Weka Explorer will open the created file. Unfortunately, with the data in its current form can not be much to do, so they must be further processed. Currently, the entrance to the algorithm makes the text of conscience and his class. We want as properties algorithm accepts a vector of words, texts, and we will turn into vectors with the help of words StringToWordVector filters. Also, we normalize all words to lowercase during conversion and exercise the option TFTransform (Term Frequency) and IDFTransform (Inverse Document Frequency). The last two options will give us information on how the term essential in a document.

Once we apply the filter, we can move on to classify the card. As for the classification algorithm will choose SMO (Sequential Minimal Optimization). This is actually a Support Vector Machines (SVM) algorithm with application specific way that solves the problem of quadratic programming. We will select the polynomial kernel. The kernel in SVM algorithm used for mapping from mnogodimenzionog space dvodimenzijalnu straight. Each word is understood as the vector in mnogodimenzionom space. When all this is ready we can click on the Start button. The output looks as follows:

Druga

For this evaluation has been used 10-fold cross validation. This method of division training set at 10 meetings. After that trains at 9, while one used for testing. This process was repeated 10 times, so that each of the sets used for testing. What measures are concerned, skoncentrisaćemo the Precision, Recall, F-Measure and Confusion Matrix.

Confusion matrix shows how many documents from that class is as classified. So we can see that from the info-class all documents are properly classified. From sports news one news is klaifikovana as news from the techno class, etc.

Precision is a measure of the relationship correctly classified documents in a particular class and documents classified as the class. (P = TP / (TP + FP), TP – correctly classified positively, FP – incorrectly positive)

Recall is the ratio of correct classification and total number of documents in that class. (R = TP / (TP + FN) TP – correctly positive, FN – faulty negative).

F-measure is a combination of Precision Recall and CDs. This measure gives an overall picture of how the algorithm is accurate. Calculated as F = 2 * R * L / (L + R).

As we can see from the results, the algorithm classifies documents quite well. With F measure of 91.4%. These results can be compared with a man’s performance in the classification of documents, with the difference that the machine will classify several hundred articles per second.

Trained model can be exported to a file by using options that are obtained by right-clicking on the box in the section of the result set. Thus exported model can then be imported into a program that will classify new documents.

In the Java programming language import model looks as follows. First you need to create a classifier:

InputMappedClassifier classifier = new InputMappedClassifier() ;

InputMappedClassifier maps the properties of the classifier from the model in the new classifier. The properties of the model in our case are the words. Other classifiers may have a problem if it appeared a new word, and will know that this treatment. Furthermore, it was necessary to load the model:

classifier.setModelPath(ClassifierPath);

After this it can be classified, provided that all the transformations that are made to input text, must be done on the new texts. So the need to convert text to vector words. The classification itself can be done with the help of methods classed classifyInstance.

Improvement

One technique that can improve the accuracy of machine learning algorithms, when the language in question is steming. Steming (Stemming) is a heuristic process of removing the ends of words in the hope that this will be achieved in most cases correctly, and sometimes includes removal of derivation affixes. The goal is to get steminga for all flexion words with the same meaning lowest common denominator. The documents will miss the stemmer for Serbian language. Unfortunately there are not many resources for language processing Serbian, so I once for the purposes of master’s thesis has developed a variation of Outsiders, whose implementation in Python can be found on GitHub (https://github.com/nikolamilosevic86/SerbianStemmer). After chiseling all articles repeat the process from previous work. The results were slightly better:

Treca

F is a measure, in this case improved by 0.8% to 92.2%. Given that the accuracy of the algorithm, but also before was very high, it is difficult to expect a big improvement, but still stemer brought some benefit.

Conclusion

As our example showed already in a relatively small data set is possible to get a fairly good accuracy by using SVM algorithm. In machine learning often larger data sets training can improve performance predictions. However, the accuracy of 92.2% can be considered quite close to the performance of man. Even if we consider the speed at which the machine will handle texts, this can certainly be considered a large acceleration.

As previously mentioned, we are still far from general artificial intelligence. But machines quite well can learn to do some simple tasks and thus accelerate our work. Research is working on a general artificial intelligence and certain shifts occur, and therefore will be interesting to see what the future brings.