Baeldung on Computer Science Learn all about Computer Science Sun, 09 Jan 2022 09:20:40 +0000 en-US hourly 1 Moth Flame Optimization Sun, 09 Jan 2022 09:20:40 +0000 Explore the Moth Flame Optimization algorithm.

The post Moth Flame Optimization first appeared on Baeldung on Computer Science.]]>
1. Introduction

“Like a moth to a flame” is an old expression relating a strong attraction to something similar to the way moths are attracted to flames. The flying behavior of moths makes it seem as if they converge on a flame.

Since, in optimization problems, we look for algorithms that can discover and converge towards optimal solutions, perhaps moths can help us design one such algorithm. Like many solutions to optimization problems, Moth Flame Optimization (MFO) is inspired by the natural world. The algorithm we explore exploits the spiraling behavior of moths around flames.

In this tutorial, we’ll cover the MFO algorithm in detail.

2. The Optimization Problem

Moths often appear to cluster around bright lights at night. It very much seems like they are attracted to these lights. In reality, it is a consequence of their natural navigation strategy, known as transverse orientation. Moths rely on a distant light source in order to fly in a straight line. They do this by keeping the light source at a fixed angle to themselves. With no unnatural light, this would, usually, be the moon. However, new light sources much closer to the moths distort this behavior. In order to keep a constant angle to this closer light source, the moths end up spiraling around the light.

We frame our optimization algorithm by considering a population of moths. As a population-based algorithm, we represent our moths using an N \times D matrix called M. N is the number of moths, and D is the dimensionality of the solution vector that a moth represents.

As we do not know where the solutions are, we need to approximate them. We do this using another matrix of moths, called the Flame matrix F. F is also an N \times D matrix, the same size as M. F represents the best solutions found so far. We think of F as the flame matrix since these are the beacons that our search agents in M will orbit. Each of M and F have an associated vector, OM and OF respectively, holding the fitness values of the proposed solutions.

2.1. Algorithm Structure

We visualize the algorithm in the presented flowchart. The algorithm is relatively compact. We can see from the flow chart that we need to initialize the moth population. We then need to initialize or update the flames. After updating the flames, the moths move towards their respective flames. This process then continues until convergence, or the process reaches the step limit. What we have left to do is to define the update behavior for moths and flames:

2.2. Spiral Behavior

Moths need to be able to fly around and approach light sources. We achieve this by defining a logarithmic spiral. Other options for spirals are also possible. Let’s now define the formula for the logarithmic spiral:

(1)    \begin{equation*} S(M,F) = D \cdot e^{bt}\cdot \cos{2\pi t} + F \end{equation*}

(2)    \begin{equation*} D = \| M - F \| \end{equation*}

The spiral depends on the distance D between a moth and its corresponding flame F. We also include b as a user parameter and t as a random variable that decreases the range of the spiral over time.

2.3. Exploration vs. Exploitation

As in many agent-based optimization paradigms, exploration/exploitation must be considered. In our case, moths are forced to exploit their one corresponding flame. This prevents over-exploitation early on and promotes exploration. On the other hand, if we kept the number of flames constant over time, we may end up under-exploiting the best solutions discovered so far. So, to avoid this, the number of flames is also decreased over time according to equation 3:

(3)    \begin{equation*} No.Flames = N - l \times \frac{N - 1}{T} \end{equation*}

l is the current iteration, N is the maximum number of flames, and T is the maximum number of iterations.

We update the moths in relation to the best corresponding flames. Since we progressively decrease the number of flames, the remaining moths are updated with respect to the last flame.

2.4. Features and Applications

We have seen how we can model moth behavior and use it to solve optimization problems. We can now re-iterate some of the features of the algorithm and discuss why they work.

Local optima can be a problem for optimization algorithms. As a population-based algorithm, many potential solutions are found and iterated upon, allowing for the avoidance of many local optima. Further, by assigning moths to flames and updating the sequence of flames, local optima are also avoided.

Solutions can be found by searching near our existing good solutions, represented by flames. The adaptive decrease in the number of flames encourages the exploitation of good solutions. The algorithm, therefore, converges over time while also being able to explore alternative areas of the search space.

3. MFO Pseudocode

Let’s see the pseudocode for MFO:

Rendered by

We can see the outline of the algorithm in the pseudocode above. First, we initialize a population of moths. We then initialize the flames and then perform an update step. After that, we combine the previous flames and newly updated moths to generate a new list of flames. Finally, we see that we repeat these update steps for a total of MaxIter time-steps.

4. Example

Suppose there is a campfire somewhere in the wilderness, and we want to locate it using the MFO algorithm. Further, suppose that moths benefit from the campfire’s heat, and that benefit is inversely proportional to their distance from the campfire. This gives us an easy to compute fitness function. Our wilderness is a 2-dimensional plane. The campfire is located at coordinates \langle 5,5 \rangle on our grid. We set the max iterations to 100.

4.1. Generate the Moths and Flames

We begin by generating N moths. For this example we set N = 3. We see the moth positions in the table below:

Rendered by

4.2. Move the Moths

We have our moths and we have scored them. We now need flames to optimize towards. These are the moths that perform the best. Achieved by sorting the first set of moths by their fitness values. We show the flames in the table below:

Rendered by

We now want to calculate the movement of each moth using equation 1. This requires first calculating the distance between the searching moth and it’s flame. That is to say moth 1 and flame 1. The distance is going to be simply the absolute value of the difference between the two vectors. We show this in equation 4:

(4)   \begin{equation*} \langle 3,3\rangle - \langle 1,2\rangle = \langle 2,1\rangle \end{equation*}

We then multiply this distance by our logarithmic spiral term and finally add the position of our flame. For our purposes we will set b = 1 and t our random exploration variable will sample as 0.3. We work this out in equation 5. The new position is closer to the moth representing the flame. It is also closer to our target campfire:

(5)    \begin{equation*} S(M,F) = \langle 2,1\rangle \cdot e^{.3}\cdot \cos{2\pi .3} + \langle 3,3\rangle = \langle 2.7,2.4\rangle \end{equation*}

4.3. Update and Repeat

We repeat this calculation for each moth. After that we then update the number of flames, we then update out list of flames using our old flames and the new moth positions:

Rendered by

In this case, we do not decrease the number of flames this iteration, because we are running the algorithm for 100 iterations. So when we combine our old flames and new moths we are still looking for 3 flames. This means our new flame matrix contains 2 flames from the previous table and 1 flame from the new moths matrix:

Rendered by

Finally we can repeat these calculations until convergence.

5. MFO Complexity

MFO relies on sorting the moths. The original paper suggests quicksort, which has a worst-case run time of O(N^2). Other sorting algorithms with better worst-case run times are possible. However, we will use quicksort. For more on choosing a sorting algorithm, see our article on choosing a sorting algorithm.

We compose the overall run-time in equation 6. We see from this that we have to run a sort at each iteration and then run the MFO update loop. Consequently, this gives us the big O complexity shown in 7:

(6)    \begin{equation*} O(MFO) = O(t(O(QuickSort)+O(MFO update))) \end{equation*}

(7)    \begin{equation*} O(MFO) = O(t(n^2 + n)) = O(tn^2 + tn) \end{equation*}

6. Conclusion

In this article, we have covered the Moth Flame Optimization algorithm. We further described the inspiration for the algorithm and then illustrated how to formalize that inspiration.

MFO is a numerical optimization algorithm with many potential uses, from hyper-parameter selection to optimizing multi-layer perceptrons. There are many similar nature-inspired population-based optimization algorithms. For further reading on this topic, we can see some of our other articles on Gray Wolf Optimization, Grasshopper Optimization, and Ant Colony Optimization.

To conclude, we summarize the benefits of MFO. As a population-based algorithm, there is some robustness to getting stuck in local minima. The approach maintains the best solutions discovered so far as flames and progressively exploits areas with good solutions. Adaptively updating the number of flames allows for balancing the exploration/exploitation trade-off. So, on the whole, MFO provides an algorithm that balances exploration against exploitation. It is also easy to understand and implement.

The post Moth Flame Optimization first appeared on Baeldung on Computer Science.]]> 0
Word2vec Word Embedding Operations: Add, Concatenate or Average Word Vectors? Fri, 07 Jan 2022 09:01:58 +0000 An overview of the word2vec algorithm and the logic behind word embeddings.

The post Word2vec Word Embedding Operations: Add, Concatenate or Average Word Vectors? first appeared on Baeldung on Computer Science.]]>
1. Introduction

Although transformers are taking over dominance in the natural language processing field, word2vec is still a popular way of constructing word vectors.

In this tutorial, we’ll dive deep into the word2vec algorithm and explain the logic behind word embeddings. Through this explanation, we’ll be able to understand the best way of using these vectors and computing new ones using add, concatenate, or averaging operations.

2. Types of Word Embedding

A word embedding is a semantic representation of a word expressed with a vector. It’s also common to represent phrases or sentences in the same manner.

We often use it in natural language processing as a machine learning task for vector space modelling. We can use these vectors to measure the similarities between different words as a distance in the vector space, or feed them directly into the machine learning model.

Generally, there are many types of word embedding methods. We’ll mention only some of the most popular, such as:

In this article, we’ll work solely with the word2vec method.

3. Word2vec

Word2vec is a popular technique for modelling word similarity by creating word vectors. It’s a method that uses neural networks to model word-to-word relationships. Basically, the algorithm takes a large corpus of text as input and produces a vector, known as a context vector, as output.

Word2vec starts with a simple observation: words which occur close together in a document are likely to share semantic similarities. For example, “king” and “queen” are likely to have similar meanings, be near each other in the document, and have related words such as “man” or “woman.” Word2vec takes this observation and applies it to a machine learning algorithm:

As a result, word2vec creates two types of vectors which represent each input word. These types are:

  • Word embedding or hidden representation (when the word is central)
  • Context word embedding or output representation (when the word is context)

We’ll describe below both types of word vectors in more detail.

3.1. Word Embedding

One way of creating a word2vec model is using the skip-gram neural network architecture. Briefly, this is a simple neural network with one hidden layer. It takes an input word as a one-hot vector, and outputs the probability vector from the softmax function with the same dimension as input.

Let’s define an input one-hot vector with x_{V \times 1}. Then the hidden layer, h_{1 \times N}, is a multiplication result between the transposed one-hot vector, x_{V \times 1}, and weight matrix, W_{V \times N}:

(1)   \begin{equation*} h = x^{T}W. \end{equation*}

Since the vector x is a one-hot vector, the hidden layer, or vector h, will always be equal to the i-th row of the matrix W if the i-th element in the vector x is equal to 1. Here is an example:

(2)   \begin{equation*} [0, 0, 1, 0, 0] \cdot \begin{bmatrix} 4 &8 &16 \\ 17 &0 &11 \\ 2 &13 &4 \\ 9 &5 &4 \\ 28 &31 &19 \\ \end{bmatrix} = [2, 13, 4]. \end{equation*}

Given the example, for a particular word w, if we input its one-hot vector into the neural network, we’ll get the word embedding as a vector h in the hidden layer. Also, we can conclude that the rows in the weight matrix W represent word embeddings.

3.2. Context Word Embedding

After we compute the hidden layer, we multiply it with the second weight matrix W`_{N \times V}. The output of this multiplication is the output vector x' on which we use activation function softmax in order to get probability distribution:

(3)   \begin{equation*} x'=softmax(hW`). \end{equation*}

Notice that the output vector has the same dimension as the input vector. Also, each element of that vector represents a probability that a particular word is in the same context as the input word.

From that, the context word embedding of the i-th word is the i-th column in the weight matrix W`. The whole neural network can be seen in the image below:

4. Add, Concatenate, or Average Word Vectors

In order to understand what to do with both embedding vectors, we need to better understand weight matrices W and W'.

4.1. Relationship Between Two Embeddings

For example, lets take the word “cat” and observe its embeddings cat_{we} and cat_{ce}. If we calculate the dot product between two embeddings, we’ll get a probability that the word “cat” is located in its own context.

With the assumption that the context of a particular word is a few words before and after that word, which is usually the case, the probability that the word is located in its own context should be very low. Basically, it means that it’s very rare to have the same word twice, close one after another, in the same text.

From that, we can assume:

(4)   \begin{equation*} cat_{we} \cdot cat_{ce} = P("cat"|"cat") \approx 0. \end{equation*}

Also, from the definition of the dot product, we have:

(5)   \begin{equation*} cat_{we} \cdot cat_{ce} = |cat_{we}||cat_{ce}| \cos{ \angle (cat_{we}, cat_{ce})}. \end{equation*}

If we assume that the magnitude of any embedding is close to zero, then the same assumption will apply to every word in the vocabulary. It’ll indicate that the dot product between embeddings of all different words is close to zero, which is unlikely. Thus, we can assume that the cosine of the angle between embedding vectors is close to 0 or:

(6)   \begin{equation*} \cos{ \angle (cat_{we}, cat_{ce})} \approx 0 \end{equation*}

This implies that the angle between two embedding vectors tends to 90^{\circ} or they’re orthogonal.

4.2. Add vs. Average

First of all, the visual representation of the sum of two vectors is a vector that we get if we place the tail of one vector to the head of another vector. The average of two vectors is the same vector, only multiplied by \frac{1}{2}. From that perspective, there’s not much difference between the sum and average of two vectors, but the average vector has two times smaller magnitude. Thus, because of the smaller vector components, we can favor average over sum:


4.3. Combining Embedding Vectors

Let’s go back to the example with the word “cat” and its orthogonal vectors, cat_{we} and cat_{ce}. Let’s assume there’s a word “kitty” in the vocabulary. We’ll also assume that we have a perfect word2vec model which has learned that the words “cat” and “kitty” are synonyms, and they appear in a very similar context. This will mean that between word embeddings of those words, cat_{we} and kitty_{we}, is high cosine similarity.

Now let’s assume the perfect scenario where the cosine similarity is 1, and the angle between vectors is 0^{\circ}. We know that there’s a practice to use only word2vec word embeddings, while context embeddings are discarded. Furthermore, there’s no evidence that context vector will follow the same similarity as word embeddings. If we add the context embedding to the word embedding, we might get the following situation presented in the image below:

The angle between cat_{we} and kitty_{we} is 0^{\circ}, and these vectors are orthogonal to their corresponding context vectors. After addition, we see that the angle between sums is no longer 0^{\circ}.

Finally, it means that even with perfect conditions, the addition of the context vector to the embedding vector can disrupt the learned semantic representation in the embedding space. A similar conclusion can be made for a concatenation.

4.4. Experiments

Although it’s clear that the addition of context vectors will disrupt the learned semantic relationship of the embedding vectors, some experiments have been done using exactly this approach. It turned out that this operation can introduce additional knowledge into embedding space, and add a small boost in performance.

For instance, comparing human judgments of similarity and relatedness to cosine similarity between combinations of W and W' embeddings has shown that using only word embeddings, W predicts better similarity while using one vector from W and another from W' gives better relatedness. For example, for the word “house,” using the first approach, most similar words were “mansion,” “farmhouse,” and “cottage,” while using the second approach, most related words were “barn,” “residence,” “estate,” and “kitchen.”

Moreover, the authors of the GloVe method used the sum of word and context embeddings in their work and achieved a small boost in performance.

5. Conclusion

Firstly, we need to make clear that the goal of this article is to discuss operations between vectors for one particular word, and not for combining word vectors from one sentence. How to get a vector representation for a sentence is explained in this article.

In this article, we explained in detail the logic behind the word2vec method and their vector in order to discuss the best solution for word vectors. Specifically, in the original word2vec paper, authors used only word embeddings as word vectors representation, and discarded the context vectors.

In contrast, some experiments were done using different combinations of these vectors. As a result, it was found that the sum between these vectors can introduce additional knowledge, but doesn’t always provide a better result. Also, from geometrical interpretation, we know that the summed and averaged vectors will behave almost the same.

Furthermore, there’s no evidence for using concatenation operations between vectors. Finally, we recommend using only word embeddings, without context vectors, since this is a well-known practice and context vectors won’t give significantly better results. However, for research purposes, the sum or average might be worth exploring.

The post Word2vec Word Embedding Operations: Add, Concatenate or Average Word Vectors? first appeared on Baeldung on Computer Science.]]> 0
Why are floating point numbers inaccurate? Fri, 07 Jan 2022 08:58:33 +0000 Learn how floating-point numbers are represented in a computer and what limitations they have.

The post Why are floating point numbers inaccurate? first appeared on Baeldung on Computer Science.]]>
1. Introduction

Computer memory has a finite capacity. Real numbers in R, do not, in general, however, have finite uniform representation. For example, consider representing the rational number \frac{4}{3} as a decimal value: it is (1.333\overline{3})_{10} – it is impossible to do so exactly! Since only a finite number of digits can be stored in computer memory, the reals have to be approximated in some fashion (rounded or truncated), when represented on a computer.

In this tutorial, we’ll go over the basic ideas of floating-point representation and learn the limits of floating-point accuracy, when doing practical numerical computing.

2. Rounding and Chopping

There are two distinguishable ways of rounding off a real number x to a given number t of decimals.

In chopping, we simply leave off all decimals to the right of the tth digit. For example, 56.555 truncated to t=1 decimal place yields 56.5.

In rounding to nearest, we choose a number with t decimals, which is the closest to x. For example, consider rounding off 56.555 to t=1 decimal place. There are two possible candidates: 56.5 and 56.6. On the real number line, 56.5 is at a distance of |56.555 - 56.5|=0.55 \times 10^{-1} from 56.555, whilst 56.6 is at a distance of |56.6 - 56.555|=0.45 \times 10^{-1}. So, 56.6 is the nearest and we round to 56.6.

Intuitively, we are comparing the fractional part of the number to the right of the tth digit, which is 0.055 = 0.55 \times 10^{-1} with 0.5 \times 10^{-1}. As 0.55 \times 10^{-1} > 0.5 \times 10^{-1}, we incremented the tth digit, which is 5 by 1.

What if, we wished to round off 56.549 to t=1 decimal places? Observe that, the fractional part is 0.49 \times 10^{-1}. As 0.49 \times 10^{-1} < 0.5 \times 10^{-1}, we leave the tth digit unchanged. So, the result is 56.5, which is indeed nearest to 56.549.

In general, let p be the fractional part of the number to the right of tth digit (after the decimal point). If p>0.5 \times 10^{-t}, we increment the tth digit. If p<0.5 \times 10^{-t}, we leave the tth digit unchanged.

In the case of a tie, when x is equidistant to two decimal t-digit numbers, we raise the tth decimal if it is odd or leave it unchanged if it is even. In this way, the error in rounding off a decimal number is positive or negative equally often.

Let’s see some more examples of rounding and chopping, to make sure, this sinks in. Assume that, we are interested in rounding off all quantities to t=3 decimal places:

Rendered by

The difference between chopping and rounding has real-world implications! The Vancouver stock exchange index began trading in 1982, with a base value of 1000.00. Although the underlying stocks were performing decent, the index began hitting the low 500s at the end of 1983. A computer program re-calculated the value of the index thousands of times each day, and the program used chopping instead of rounding to the nearest. A rounded calculation gave a value of 1082.00.

3. Computer Number Systems

In our daily life, we represent numbers using the decimal number system with base 10. In the decimal system, we’re aware, that if a digit d_{-k} stands k digits to the right of the decimal point, the value it contributes is d_{-k} \cdot 10^{-k}. For example the sequence of digits 4711.303 means:

    \begin{align*} 4 \cdot 10^3 + 7 \cdot  10^2 + 1 \cdot 10^1 + 1\cdot 10^0 + 3 \cdot 10^{-1} + 0 \cdot 10^{-2} + 3 \cdot 10^{-3} \end{align*}

In fact any integer \beta \geq 2 (or \beta \leq -2) can be used as a base. Analogously, every real number x \in R has a unique representation of the form:

    \begin{align*} x = d_n \beta^n + d_{n-1}\beta^{n-1} + \ldots + d_1 \beta^{1} + d_0 \beta^0 + d_{-1}\beta^{-1} + d_{-2} \beta^{-2} + \ldots \end{align*}

or compactly d_{n}\ldots d_1 d_0.d_{-1} d_{-2} \ldots, where the coefficients d_i, the digits in system \beta, are positive integers such that 0 \leq d_i < \beta.

3.1. Conversion Algorithm Between Two Number Systems

Consider the problem of conversion between two number systems with different bases. For the sake of concreteness, let’s try to convert (250)_{10} to the binary format. We may write:

    \begin{align*} 250 = d_7 \cdot 2^7 + d_6\cdot 2^6 + d_5 \cdot 2^5 + d_4\cdot 2^4 + d_3\cdot 2^3 + d_2\cdot 2^2 + d_1\cdot 2^1 + d_0 \end{align*}

We can pull out a factor of 2 from each term, except the last, and equivalently write:

    \begin{align*} 250 = 2 \times (d_7 \cdot 2^6 + d_6\cdot 2^5 + d_5 \cdot 2^4 + d_4\cdot 2^3 + d_3\cdot 2^2 + d_2\cdot 2^1 + d_1) + d_0 \end{align*}

Intuitively, therefore, if we were to divide q_0=250 by 2, the expression in brackets, let’s call it q_1, is the quotient and d_0 is the remainder of the division. Similarly, since q_1 = 2 \times (d_7 \cdot 2^5 + d_6\cdot 2^4 + d_5 \cdot 2^3 + d_4\cdot 2^2 + d_3\cdot 2^1 + d_2) + d_1, division by 2 would return the expression in the brackets as the quotient; call it q_2, and d_1 as the remainder.

In general, if a is an integer with base \alpha and we want to determine it’s representation in a number system with base \beta, we perform successive divisions of a with \beta: set q_0 = a and

    \begin{align*} q_k = \beta \times q_{k+1} + d_k, \quad k = 0,1,2,\ldots \end{align*}

q_{k+1} is the quotient and d_k is the remainder in the division.

Let’s look at the result of applying the algorithm to (250)_{10}:

Rendered by

Therefore, (250)_{10} = (d_7 d_6 d_5 d_4 d_3 d_2 d_1 d_0)_2 = 1111\,1010.

3.2. Converting Fractions to Another Number System

If the real number a is not an integer, we write it as a = b + c, where b is the integer part and

    \begin{align*} c = c_{-1}\beta^{-1} + c_{-2}\beta^{-2} + c_{-3}\beta^{-3} + \ldots \end{align*}

is the fractional part, where c_{-1},c_{-2},c_{-3},\ldots are to be determined.

Observe that, multiplying both sides by the base \beta yields,

    \begin{align*} c \cdot \beta = \underbrace{c_{-1}}_{\text{Integer}} + \underbrace{c_{-2}\beta^{-1} + c_{-3}\beta^{-2} + \ldots}_{\text{Fractional part}} \end{align*}

an integer and a fractional portion. The integer portion is precisely c_{-1} – the first digit of c represented in base \beta. Consecutive digits are obtained as the integer parts, when successively multiplying c by \beta.

In general, if a fraction c must be converted to another number system with base \beta, we perform successive multiplications of c with \beta: set p_0 = c and

    \begin{align*} p_{k} \cdot \beta = c_{k-1} + p_{k-1}, \quad k = 0,-1,-2,\ldots \end{align*}

Let’s look at an example of converting (0.15625)_{10} to binary number system:

Rendered by

Therefore, (0.15625)_{10} = 0.0010\,1000 in the binary system.

3.3. Fixed and Floating-Point Representation

Computers are equipped to handle pieces of information of a fixed size called a word. Typical word lengths are 32-bits or 64-bits.

Early computers made numerical calculations in a fixed-point number system. That is, real numbers were represented using a fixed number of t binary digits in the fractional part. If the word-length of the computer is s+1 bits, (including the sign bit), then only numbers in the bounded interval I=[-2^{s-t},2^{s-t}] are permitted. This limitation is problematic, since, for example, even when x \in I and y \in I, it is possible that x - y is outside the bounds of the interval I.

In a floating-point number system, a real number a is represented as:

    \begin{align*} a = \pm m \cdot \beta^e, \quad m = (0.d_1 d_2 d_3 \ldots)_\beta, \quad e \text{ an integer } \end{align*}

The fractional part m of the number is called the mantissa or significand. e is called the exponent and \beta is the base. It’s clear that d_1 \neq 0, because if d_1 = 0, then we can always decrease the exponent by 1 and shift the decimal point one place to the right.

The mantissa m and the exponent e are limited by the fixed word-length in computers. m is rounded off to a number \overline{m} with t digits and the exponent e lies in a certain range.

Thus, we can only represent floating-point numbers of the form:

    \begin{align*} a = \pm \overline{m} \cdot \beta^e, \quad \overline{m} = (0.d_1 d_2 d_3 \ldots d_t)_\beta, \quad e_{min} \leq e \leq e_{max}, \quad e \in \mathbf{Z} \end{align*}

A floating-point number system F is completely characterized by the base \beta, precision t, the numbers e_{min} and e_{max}. Since d_1 \neq 0, the set F, contains, including the number 0,

    \begin{align*} 2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1) + 1 \end{align*}

numbers. Intuitively, there are 2 choices for the sign. The digit d_1 can be chosen from the set \{1,2,\ldots,\beta - 1\}. Each of the successive t-1 digits can be chosen from \{0,1,2,\ldots,\beta - 1\}. The exponent can be chosen from e_{max} - e_{min} + 1 numbers. By the multiplication rule, there are 2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1) distinguishable numbers. Including the number 0 gives us the expression above.

3.4. Why are Floating-Point Numbers Inaccurate?

Consider a toy floating-point number system F(\beta = 2, t = 3, e_{min}=-1, e_{max}=2). The set F contains exactly 2 \cdot 16 + 1 = 33 numbers. The positive numbers in the set F are shown below:

Rendered by

It is apparent that not all real numbers, for instance in, \mathbf{[1,2]} are present in \mathbf{F}. Moreover, not all floating-point numbers are equally spaced; the spacing jumps by a factor of \beta at each power of \beta.

The spacing of the floating-point numbers is characterized by the machine epsilon \mathbf{\epsilon_M}, which is the distance from 1.0 to the next largest floating-point number.

If a real number x is in the range of the floating-point system, the obvious thing to do is to round off x to \overline{x}, where \overline{x}=float(x) is the floating-point number in F, that is closest to x. It should already be clear that representing \mathbf{x} by \mathbf{float(x)} introduces an error.

One interesting question is, how large is this error? It would be great if we could guarantee, that the round-off error at no point exceeds a certain amount. Everybody loves a guarantee! In other words, we seek an upper bound for the relative error:

    \begin{align*} \frac{|fl(x) - x|}{|x|} \end{align*}

Recall, from the earlier discussion, that when rounding m to t decimals, we leave the tth decimal unchanged, if the part p, of the number to right of the tth decimal, is smaller than \frac{1}{2} \times 10^{-t}, else raise the tth decimal by 1. Here, we are working with the generic base \beta instead of the decimal base 10. Consequently, the round-off error in the mantissa is bounded by:

    \begin{align*} |\overline{m} - m| \leq \frac{1}{2}\cdot \beta^{-t} \end{align*}

The relative round-off error in x is thus bounded by:

    \begin{align*} \frac{|fl(x) - x|}{|x|} &= \frac{|\overline{m} - m| \cdot \beta^e}{m \cdot \beta^e}\\ &\leq \frac{\frac{1}{2}\cdot \beta^{-t} \cdot \beta^e}{(0.1)_\beta \beta^e} \quad \quad \{ m \geq (0.1)_\beta \} \\ &= \frac{1}{2} \cdot \beta^{-t + 1} \end{align*}

Modern floating-point standards such as the IEEE guarantee this upper bound. This upper bound is called the rounding unit, denoted by \mathbf{u}.

4. IEEE Floating-point Standard in a Nutshell

Actual computer hardware implementations of floating-point systems differ from the toy system, we just designed. Most current computers conform to the IEEE 754 standard for binary floating-point arithmetic. There is two main basic formats – single and double precision, requiring 32-bit and 64-bit storage.

In single-precision, a floating-point number a is stored as the sign s (1 bit), the exponent e (8 bits), the mantissa m (23 bits).

In double-precision, 11 bits are allocated to the exponent e, whereas 52 bits are allocated to the mantissa m. The exponent is stored as b=e+1023.

Rendered by

The value v of the floating-point number a in the normal case is:

    \begin{align*} v = (-1)^s (1.m)_2 \cdot 2^e, \quad \quad e_{min} \leq e \leq e_{max} \end{align*}

Note, that the digit before the binary point is always 1, similar to the scientific notation we studied in high school. In that way, 1 bit is gained for the mantissa.

4.1. Quirks in Floating-Point Arithmetic

Consider the following comparison:

(0.10 + 0.20) == 0.30

The result of this logical comparison is false. This abrupt behavior is expected because the floating-point system is broken. However, let’s take a deeper look at what’s going on.

Let’s put 0.10 in the double-precision format. Because 0.10 is positive, the sign bit s=0.

0.10 in base-2 scientific notation can be written as (-1)^0 (1 + m) 2^{e}. This means we must factor 0.10 into a number (1 + m) in the range 1 \leq m < 2 and a power of 2. If we divide 0.10 by different powers of 2, we get:

    \begin{align*} 0.10 / 2^{-1} &= 0.20\\ 0.10 / 2^{-2} &= 0.40\\ 0.10 / 2^{-3} &= 0.80\\ 0.10 / 2^{-4} &= 1.60 \end{align*}

Therefore, 0.10 = 1.60 \times 2^{-4}. The exponent is stored as e + 1023=-4 + 1023 = 1019, so the bit pattern in the exponent part is (011\, 1111\, 1011)_2.

The mantissa m is the fractional part 0.60 in the binary form. Successive multiplication by 2 quickly yields m = (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001 \ldots)_2. However, the double-precision format allocates t=52 bits to the mantissa, so we must round off m to 52 digits. The fractional part p, after the 52nd digit, (0.1001 \ldots)_2 \times 2^{-52} exceeds \frac{1}{2} \beta^{-t} = (0.1)_2 \times 2^{-52}. So, we raise the tth decimal by 1. Thus, the rounded mantissa is \overline{m}= (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010)_2.

Finally, we put the binary strings in the correct order. So, 0.10 in the IEEE double-precision format is:

    \begin{align*} \underbrace{0}_{s}\,\underbrace{011\, 1111\, 1011}_{e+1023}\,\underbrace{1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010}_{\overline{m}} \end{align*}

This machine number is approximately 0.10000000000000000555 in base 10. In a similar fashion, the closest machine number to 0.2 in the floating-point system is approximately 0.2000000000000000111. Therefore, float(0.1) + float(0.2) > 0.30. On the right hand side, the closest machine number to 0.3 is 0.29999999999999998889. So, float(0.30) < 0.30.

To summarize, algebraically equivalent statements are not necessarily numerically equivalent.

5. Conclusion

In this article, we’ve learned how the rules of chopping and rounding to the nearest work. We developed an algorithm for conversion between number systems. We also learnt that, any floating-point system is characterized by a quadruple F(\beta,t,e_{min},e_{max}). For example, the IEEE double-precision format is given as F(\beta=2,t=52,e_{min}=-1022,e_{max}=1023). Because computers have finite memory capacity, the set of machine numbers, \mathbf{M} are only a subset of the real numbers \mathbf{R}. The spacing between machine numbers in technical speak is called the machine epsilon.

Further, any real number x is always rounded off to the nearest machine number on a computer. The IEEE standards guarantee that the relative roundoff error is no more than a certain amount. Moreover, as programmers, we need to take proper care when doing floating-point operations.

The post Why are floating point numbers inaccurate? first appeared on Baeldung on Computer Science.]]> 0
Drift, Anomaly, and Novelty in Machine Learning Thu, 06 Jan 2022 17:11:43 +0000 Learn about novelty, concept and data drifts.

The post Drift, Anomaly, and Novelty in Machine Learning first appeared on Baeldung on Computer Science.]]>
1. Introduction

Anomaly and outlier in machine learning are inseparable concepts. Therefore, detecting and handling instances that are different from the norm plays an essential role in building a robust and efficient model.

On top of cleaning the dataset before the training phase, we apply anomaly detection techniques to find unusual instances. Some well-known use cases are fraud detection in finance, defect detection in manufacturing, intrusion detection in IT security, and condition monitoring in healthcare.

Anomaly detection is also closely related to novelty detection and noise removal. All of these areas focus on finding and handling different data instances.

Most of the time, momentary errors cause anomalies and novelties. However, we can also observe the abnormal values. Furthermore, the underlying patterns within the data can change over time.

This tutorial will focus on some fundamental terms in detecting unusual observations and shifts in trends.

2. Drift

When training a model, we assume that the data and the environment are stationary. So, we build a model to predict the outcomes based on the examples observed under certain conditions. But, as the circumstances change, the behavior of the data evolves as well.

We observe changes in the relationship between features or underlying distribution or structure of the data as time passes. We call it drift. 

Model decay measures the speed of a model’s accuracy decrease over time. It’s affected by the drift rate.

There are two types of drift:

  • Concept drift
  • Data drift

Data drift happens when the input data is affected in an unforeseen way. For example, changing the data collection logic or introducing a new category causes data drift. To sum up, data drift happens when the input data distribution changes.

On the other hand, concept drift occurs when the output changes over time. As a result, the relationship between the dataset and the target variable change. There are three types of concept drift:

  • Gradual drift
  • Sudden drift
  • Recurring drift

A gradual drift is when a set of minor variances accumulate over time. For example, the effects of climate change are minimal and undetectable for short periods.

Conversely, a sudden drift changes the predicted value at once. For instance, the effects of the 2008 global financial crisis are suddenly observable on various economic variables.

There are also recurring drifts, where the patterns follow a change over time. As an example, the sales of ice cream would change over the year.

To neutralize the effects of drift, we need to recalibrate the model or change the business process.

3. Anomaly

Anomalies define non-typical events that only happen under exceptional circumstances. As a result, they are often mistaken for outliers.

There are three classes of anomaly detection tasks:

  • Point anomaly
  • Contextual anomaly
  • Collective anomaly

Point anomalies occur when an individual observation is abnormal compared to the rest of the observations in the dataset. An unusual occurrence is generally easy to spot. For example, observing snow in July for the northern hemisphere is an excellent example of a point anomaly.

Contextual anomalies occur when a particular observation doesn’t fit in a specific context. For instance, having 100 visitors within an hour would be an anomaly for website A, but it would be typical for website B.

Collective anomalies happen when a set of observations is non-typical concerning the rest of the data set. An example would be a website receiving only half of its usual traffic within a time range.

We can apply statistical models such as z-score or interquartile analysis to detect anomalies in the dataset. Alternatively, we can utilize supervised learning algorithms like isolation forests or SVMs. Or, we can turn to semi-supervised and unsupervised learning algorithms, such as clustering and dimension reduction.

We don’t need to recalibrate the model to eliminate the effects of an anomaly, as its existence doesn’t mean that our model is no longer valid.

4. Novelty

Novelty is a new observation that is not similar to the dataset.

We start the training process with a preprocessed dataset in novelty detection, where all the outliers are eliminated. Then, we train the detection algorithm to output whether a new observation fits with the training data or not. In this context, we also treat outliers as novelties.

We can think of novelty detection as a semi-supervised outlier detection process. In the case of novelties, we don’t change the model or business model to answer them.

The One-Class SVM is a well-known novelty detection algorithm. It learns the frontier of observations in the initial dataset. Then, it decides whether it’s coming from the same distribution for each new observation. As a result, the data points falling outside the frontier are marked as novelties.

Novelty detection is widely adopted in online training tasks, as the new observations need to be classified as outliers or not in real-time.

5. Comparison

Outliers are closely related to anomalies and novelties, as well as drift.

The term drift refers to a change in the data regime. It can either result from a change in the environment or a difference in business logic. Consequently, drift in the data leads to a decrease in model accuracy. Hence, we’ll need to change the model. 

Drift is a concept, whereas anomaly or novelty are data instances. However, they are both closely related to outliers.

In simple terms, we can think of anomalies as unusual or unexpected data instances within a dataset. The term is often used interchangeably with outliers.

Similarly, novelties are also anomalies in data, but they only exist in new instances. They don’t reside in the original dataset.

The presence of outliers, anomalies, or novelties doesn’t imply a change in the underlying data distribution or regime. Hence, detecting them won’t invalidate a model. So, there’s no need to change the model. 

6. Conclusion

In this article, we’ve learned about and compared some elementary concepts related to outliers in machine learning.

Firstly, we’ve defined drift and its types: concept and data drift. Then, we’ve given the meaning of anomaly and provided examples of its varieties: point, contextual, and collective anomaly. After that, we’ve described novelty.

Lastly, we’ve compared them to understand their differences better and concluded.

The post Drift, Anomaly, and Novelty in Machine Learning first appeared on Baeldung on Computer Science.]]> 0
Markov Decision Process: How Does Value Iteration Work? Tue, 04 Jan 2022 19:19:45 +0000 Learn how to implement a dynamic programming algorithm to find the optimal policy of an RL problem, namely the value iteration strategy.


The post Markov Decision Process: How Does Value Iteration Work? first appeared on Baeldung on Computer Science.]]>
1. Introduction

In some machine learning applications, we’re interested in defining a sequence of steps to solve our problem. Let’s consider the example of a robot trying to find the maze exit with several obstacles and walls.

The focus of our model should be how to classify a whole sequence of actions from the initial state that brings the robot to the desired state outside of the maze.

In Machine Learning, this type of problem is usually addressed by the Reinforcement Learning (RL) subfield that takes advantage of the Markov decision process to model part of its architecture. One of the challenges of RL is to find an optimal policy to solve our task.

In this tutorial, we’ll focus on the basics of Markov Models to finally explain why it makes sense to use an algorithm called Value Iteration to find this optimal solution.

2. Markov Models

To model the dependency that exists between our samples, we use Markov Models. In this case, the input of our model will be sequences of actions that are generated by a parametric random process.

In this article, our notation will consider that a system with N possible states S_{1},S_{2},...,S_{N} will be at a time t=1,2,... in a state q_{t}. So, the equality q_{3}=S_{2} indicates that at the time 3, the system will be at the state S_{2}.

We should highlight that the subscript t is referred to as time, but it could be other dimensions, such as space.

But how can we put together the relationship between all possible states in our system? The answer is to apply the Markov Modelling strategy.

2.1. Theoretical Fundamentals

If we are at a state S_{i} at the time t, we can go to state S_{j} at the time t+1 with a given or calculated probability that depends on previous states:

(1)   \begin{equation*} P(q_{t+1}=S_{j}|q_{t}=S_{i}, q_{t-1}=S_{k},...) \end{equation*}

To make our model slightly simpler, we can consider that only the immediately previous state influences the next state instead of all preceding states. Only the current state is relevant to define the future state, and we have a first-order Markov model:

(2)   \begin{equation*} $P(q_{t+1}=S_{j}|q_{t}=S_{i})$ \end{equation*}

To simplify our model a little more we remove the time dependency and we define that from a state S_{i} we’ll go to a state S_{j} by the means of a transition probability:

(3)   \begin{equation*} a_{i,j} = P(q_{t+1}=S_{j}|q_{t}=S_{i}) \end{equation*}

One of the easiest ways for us to understand these concepts is to visualize a Markov Model with two states connected by four transitions:

The variable \pi_{i} represents the probability of starting at state i. As we removed the time dependency from our model, at any moment, the transition probability of going from state 1 to 2 will be equal to a_{12} regardless of the value of t.

Moreover, the transition probabilities for a given state S_{i} should satisfy one condition:

(4)   \begin{equation*} \sum_{j=1}^{N}a_{ij}=1 \end{equation*}

For the \pi_{i} case we have an equivalent condition, since the sum of all possible initial states should be equal to 100\%:

(5)   \begin{equation*} \sum_{i=1}^{N}\pi_{i}=1 \end{equation*}

Lastly, if we know all states q_{t}, our output will be the sequence of the states from an initial to a final state, and we’ll call this sequence observation sequence O. The probability of a sequence can be calculated:

(6)   \begin{equation*} P(O=Q|\boldsymbol{A,\Pi}) = P(q_{1}) \prod_{t=2}^{T}P(q_{t}|q_{t-1})=\pi_{q_{1}}a_{q_{1}q_{2}}...a_{q_{T-1}q_{T}} \end{equation*}

In which \boldsymbol{\Pi} represents the sum of initial probabilities and \boldsymbol{A} is the vector of transition probabilities. \pi_{q_{1}} represents the probability of start at state q_{1}. While a_{q_{1}q_{2}} represents the probability of reach state q_{2} from q_{1}.

To illustrate all this theory, let’s consider \pi_{1}=0.7 and \pi_{2}=0.3 with defined values for the transition probabilities:

We’ll consider the observation sequence O = \{1, 2, 2\}. The probability of this sequence can be calculated:

(7)   \begin{equation*} P(O=Q|\boldsymbol{A,\Pi}) = P(q_{1})  \cdot P(q_{2}|q_{1}) \cdot P(q_{2}|q_{2}) \end{equation*}

After we replace the defined values:

(8)   \begin{equation*} = \pi_{1} \cdot a_{12} \cdot a_{22} = 0.7 \cdot 0.9 \cdot 0,25 = 0.1575 \end{equation*}

2.2. Hidden Markov Models

In the previous example, all states were known, so we call them observable. When the states are not observable, we have a hidden Markov model (HMM).

To define the probabilities of a state, we first need to visit this state and then write down the probability of the observation. We consider a good HMM model one that correctly uses real-world data to create a model to simulate the source.

It is crucial that we remember that in this type of model, we have two sources of randomness: the observation in a state and the movement from one state to another.

We usually have two main problems in HMM:

  1. Considering a model \lambda, we’re interested in calculating the probability P(O|\lambda), which means the probability of any observation sequence O
  2. Finding the state sequence with the higher chances of generating the observation sequence O

3. Reinforcement Learning

As we stated in the introduction of this article, some problems in Machine Learning should have as a solution a sequence of actions instead of a numeric value or label.

In Reinforcement Learning, we have an agency responsible for the decision-making to select an action in an environment.

3.1. Example

At any given state, the agent will take an action, and the environment will return a reward or penalty. A sequence of actions is called policy, and the primary goal of an RL problem is to find the optimal policy to maximize the total reward.

If we return to the initial example of a robot trying to find the exit of a maze, the robot would be the learner agent, and the environment would be the labyrinth:

From this point, we can make an analogy with the Markov model since the solution for this problem is a sequence of actions.

A Markov Decision Process is used to model the agent, considering that the agent itself generates a series of actions. In the real world, we can have observable, hidden, or partially observed states, depending on the application.

3.2. Mathematical Model

In our explanation, we’ll consider that the agent will be at a state s_{t} in a given discrete-time t. The variable S will represent all the possible states. The available actions will be represented as A(s_{t}), and after an action is taken and the agent goes from s_{t} to s_{t+1} a reward r_{t+1} is computed.

As we defined before, we have here a first-order Markov model since the next state and reward depend only on the current state and action.

From now on, we’ll consider that the policy \pi defines the mapping of states to actions \pi: S \to A. So, if we’re at a state s_{t}, the policy will indicate the action to be taken.

The value of a policy referred to as V^{\pi}(s_{t}) is the expected value after all the rewards of a policy are summed starting from s_{t}. If we put into our mathematical model:

(9)   \begin{equation*} V^{\pi}(s_{t})= E[r_{t+1}+r_{t+2}+ \ldots + r_{t+T}]= E\left[ \sum_{i=1}^{T} r_{t+1} \right] \end{equation*}

We should highlight that the above equation only works when we have a defined number of steps T to take. In case we don’t have this value, we have an infinite-horizon model  that will penalize future rewards by means of a discount rate \gamma:

(10)   \begin{equation*} V^{\pi}(s_{t})= E[r_{t+1}+\gammar_{t+2}+ \gamma^{2}r_{t+3} + \ldots = E\left[ \sum_{i=1}^{\infty} \gamma^{i-1}r_{t+i} \right] \end{equation*}

Our main goal is to find the optimal policy \pi^{*} that will lead our cumulative reward to its maximum value:

(11)   \begin{equation*} V^{*}(s_{t})= max \: V^{\pi}(s_{t}), \forall s_{t} \end{equation*}

We can also use a pair of state and action to indicate the impact of taking action a_{t} being at the state s_{t}. We leave the use of V(s_{t}) to simply indicate the performance of being at state s_{t}.

We start from the knowledge that the value of a state is equal to the value of the best possible action:

(12)   \begin{equation*} V^{*}(s_{t})=max \: Q^{*}(s_{t},a_{t}) \end{equation*}

(13)   \begin{equation*} V^{*}(s_{t}) = max \left( E[r_{t+1}]+\gamma \sum_{s_{t+1}}^{} P(s_{t+1}|s_{t},a_{t})V^{*}(s_{t+1}) \right) \end{equation*}

This represents the expected cumulative reward V^{*}(s_{t+1}) considering that we move with probability P(s_{t+1}|s_{t},a_{t}).

If we use the state-action pair notation, we have:

(14)   \begin{equation*} Q^{*}(s_{t},a_{t})=E[r_{t+1}]+ \gamma \sum_{s_{t+1}} P(s_{t+1}|s_{t},a_{t}) max \: Q^{*}(s_{t+1},a_{t+1}) \end{equation*}

4. Value Iteration

Finally, to find our optimal policy for a given scenario, we can use the previously defined value function and an algorithm called value iteration, which is an algorithm that guarantees the convergence of the model.

The algorithm is iterative, and it will continue to execute until the maximum difference between two consecutive iterations l, and l+1 is less than a threshold \delta:

Rendered by

Now we’ll see a simple example of the value iteration algorithm simulating an environment.

Our full map is a 6 \times 6 grid, but for the sake of simplicity, we’ll consider a small part of the maze consisting of a 3 \times 3 grid with a reward of 50 in the center with their values initialized to zero.

Our robot can only take four actions: move up, down, left or right, and if it hits the wall to the right, it receives a penalty of 20:

As we’re simulating a stochastic model, there are probabilities associated with the actions. In this case, there is a 0.7 chance of the robot actually going in the defined direction and a 0.1 chance for the other three options of movement.

So if the agent decides to go top, there is a 70% chance of going top and a 10% chance of going down, right or left, each. The robot only gets the reward when the action is completed, not during the transition.

In the first iteration, each cell receives its immediate reward. As the optimal for the cell c is going down with a 0.1 probability of hitting the wall, we can calculate the immediate reward as (0.1 \times 20)=-2, and the map becomes:

For the next steps, we’ll use a discount rate \gamma = 0.9. For the second iteration, we’ll consider first the cell f, and we’ll calculate the value of going taking the optimal action, which is going left:

The state value will be the sum of the reward and the next value multiplied by the probability for all actions:

(15)   \begin{equation*} 0.7 (0+ (0.9 \times 50)) + 0.1(-20+(0.9 \times -2)) + 0.1(0+(0.9 \times -2)) + 0.1(0+(0.9 \times -2)) =31.5 -2.18 -0.18 -0.18 = 28.96 \end{equation*}

Similarly, to cell d we observe that the optimal action is to go right and all other directions give a expected reward of zero, so the state value is simply:

(16)   \begin{equation*} 0.7 (0 + 0.9 \times 50) = 31.5 \end{equation*}

After we calculate for all cells, the map becomes:

We only conducted two iterations, but the algorithm continues until the desired number of iterations of the threshold is achieved.

5. Conclusion

In this article, we discussed how we could implement a dynamic programming algorithm to find the optimal policy of an RL problem, namely the value iteration strategy.

This is an extremely relevant topic to be addressed since the task of maximizing the reward in this type of model is the core of real-world applications. Since this algorithm is guaranteed to find the optimal values and to converge, it should be considered when implementing our solution.

The post Markov Decision Process: How Does Value Iteration Work? first appeared on Baeldung on Computer Science.]]> 0
Exporting LaTeX TikZ as Image Files Tue, 04 Jan 2022 19:18:30 +0000 Learn how to export LaTeX/TikZ defined images to image file formats.

The post Exporting LaTeX TikZ as Image Files first appeared on Baeldung on Computer Science.]]>
1. Introduction

Recently, LaTeX became a relevant tool to write texts. With LaTeX, writers can concentrate on the content instead of the format of their manuscripts.

TikZ, in turn, is a LaTeX package to create graphic elements with LaTeX. In short, TikZ is a powerful tool that enables users to draw almost anything from scratch using vectors.

In this way, users can explore the TikZ capacities to create images for multiple purposes besides only including them in their LaTeX writeups. To make holistic use of these images, we can convert TikZ defined images to other popular image file formats.

In this tutorial, we’ll learn a technique to convert TikZ defined images to these other image file formats. First, we’ll see how to code and compile a LaTeX/TikZ image. Thus, we’ll see a tool and the process to convert the LaTeX/TikZ output to an image file format.

2. Preparing the LaTeX/TikZ Image

Let’s consider an example .tex document defining a TikZ image. The following algorithm shows the LaTeX/TikZ code of our example document:

(1) \documentclass{standalone}
(2) \usepackage{tikz}
(3) \begin{document}
(4) \begin{tikzpicture}
(5) \filldraw[red] (0,0) circle (100pt);
(6) \end{tikzpicture}
(7) \end{document}

To create a standalone image with LaTeX/TikZ, first, we should define the document class. So, it is a suitable option to choose a document class that fits the LaTeX output (the PDF file) to our image. An example of such a document class is the standalone class. This definition locates at line 1 of the algorithm.

Next, we should insert the line importing the TikZ package as in line 2 of the algorithm.

After defining these lines, we can open our document environment (line 3). Thus, we can open our TikZ image environment (line 4).

So, the following lines contain the definition of the aimed image. The image definition uses the TikZ graphic commands set. In our example, we will only draw a red circle (line 5).

Finally, we close the TikZ image (line 6) and the LaTeX document (line 7) environments.

After defining the LaTeX/TikZ image, we compile it and generate the standard LaTeX output: a PDF file. This PDF file, in turn, contains our image.

The following image summarizes the process:

Next, we’ll learn how to convert this file into an image format in the following subsection.

3. Converting the LaTeX/TikZ PDF to Image

There exist multiple ways to convert the LaTeX standard output (PDF) to an image file. In this tutorial, however, we’ll use the ImageMagick tool technique.

We choose the ImageMagick tool because it is a free and simple program that provides versions to several operational systems, such as Windows, Linux, and macOS.

First, we must download and install the program (if requested, select the checkbox of “Install legacy utilities”). Furthermore, make sure that ImageMagick’s scripts are on the PATH of your operational system.

Next, we must install the GhostScript PDF interpreter. The ImageMagick tool employs GhostScript to process PDF files.

With the previous steps ready, we can convert the PDF with our image to an image file. Among the supported formats by ImageMagick are PNG (.png), JPG (.jpg), and SVG (.svg). In our examples, let’s call the chosen format of EXT (.ext).

The most straightforward conversion is executed through the following command:

convert src/path/image.pdf dst/path/image.ext

In this way, the following image shows the general process of converting the LaTeX output (PDF) into an image file format:

Finally, the ImageMagick tool provides several options to modify the image while converting it. We can use these options by adding flags in the command line, such as -quality-density, and -resize.

4. Conclusion

In this tutorial, we learned how to export LaTeX/TikZ defined images to image file formats. At first, we saw how to define a standalone image using LaTeX and TikZ. Then, we investigated a technique to convert the LaTeX output (PDF) to image file formats with the Image Magick tool.

We can conclude that LaTeX/TikZ is a versatile tool for producing vectorial graphic elements. Furthermore, with the proper converting tools, these graphic elements are employable for several purposes as images.

The post Exporting LaTeX TikZ as Image Files first appeared on Baeldung on Computer Science.]]> 0
Fundamentals of Sandboxing Mon, 03 Jan 2022 10:06:00 +0000 Explore the different concepts and architectures used to sandbox applications.

The post Fundamentals of Sandboxing first appeared on Baeldung on Computer Science.]]>
1. Introduction

In computer security and software development, a sandbox refers to an environment in which a program can run isolated from the rest of your system with limited access to resources. This is useful for testing out parts of a process that may be susceptible to triggering system failures or misbehaving.

When anything happens in a sandbox, it is not supposed to leak onto the rest of the machine.  Typically, we can restrain the memory space that the process allocates, the network access that it has (if any), and the directories to which it can have access.

In this article, we explore the different concepts and architectures used to sandbox applications.

2. Concepts

Each one of the following architectures will have benefits and flaws. These vary according to their deployment speed, their level of abstraction from hardware components, and the complexity of the system upon which the application needs to run. In general, approaches that segment hardware components at a low level will have better isolation.

In contrast, approaches that abstract the architecture through multiple software levels are more flexible and scalable:

2.1. Virtual Machines & Hypervisors

Firstly, virtual machines (VMs) can be seen as “mini computers” inside the same machine hardware. These different combinations of independent operating systems with separate files, libraries, and applications can all run inside the same computer thanks to hypervisors. Hypervisors are pieces of software or firmware that can distribute the existing hardware resources to each of our different virtual machines.

This process is called “virtualization” and it is similar to setting virtual barriers between different environments on the same computer. Virtual machines take a few minutes to start and are usually slower than containers. However, they provide better isolation between applications.

2.2. Containers & Container Engines

Moving on from VMs, containers are another way of separating apps. Apps that run on the same OS can be containerized, deployed, and scaled automatically. Containers have grown in popularity recently with the advancement of tools like Docker and computer clustering tools like Kubernetes.

These different applications all run on the same hardware but have different limitations. Container engines are employed to manage network connections, memory, and other resources available to containers. These resources can even be scaled according to the needs of each container.

Because deploying this approach is faster than using virtual machines, containers provide a way to quickly assemble and take apart networks of applications communicating with each other as each one will run in its own little pre-fabricated environment. Containers take seconds to deploy as VMs take minutes.

2.3. Operating System Emulation

For this approach, the sandbox will replicate an operating system, but will not emulate any physical hardware. The complete emulation of one operating system can be a heavy task for a computer.

Therefore, a lot of operating system emulation software will only emulate some of the most common system calls.

Advanced types of malware can use infrequent system calls to circumvent this type of sandbox by appearing inactive inside the sandbox.

2.4. Full System Emulation

In full system emulation, all of the resources on your machine will be simulated, in order to provide a look inside and see what the sandboxed process is doing.

Everything from the CPU to the display on the screen is simulated by software and then handed to the sandboxed program. This is similar to containers, only that emulation is not restrained by the operating system architecture of the underlying machine.

Emulation engines can simulate a wide variety of system architectures and are not restrained to personal computers or servers. They can replicate cellphone and video game console operating systems, for example. Because the entire system is emulated, it is possible to analyze the information being sent to the different parts of the computer.

This is used to deeply understand the behavior of the application running inside it.

3. Evasion Techniques

Clever pieces of malware software will be able to detect the environment in which they are running. By using this skill, different malware can appear safe when running inside a simulated environment and suddenly become active when it detects an actual host.

The malware can then employ different techniques to mask its own tracks. One of the most popular techniques to do this is called process injection. This technique will run a payload code in the same address space as a running process. This will often make it harder to detect the specific process tied to the malware.

Although full system emulation can fix this problem, the quality of the emulation needs to be sufficient to appear as an actual system to the malware.

4. Conclusion

In this article, we discussed the different fundamental concepts in sandboxing.

The post Fundamentals of Sandboxing first appeared on Baeldung on Computer Science.]]> 0
Non-opinionated vs. Opinionated Design Mon, 03 Jan 2022 10:00:10 +0000 Explore the concepts of non-opinionated and opinionated software design.

The post Non-opinionated vs. Opinionated Design first appeared on Baeldung on Computer Science.]]>
1. Introduction

From the recent Software-as-a-Service (SaaS) era, software development faced a challenging decision: create generic software that enables the users to define their own product line to use, or develop a software that guides the user to a certain way to accomplish tasks?

Thus, on one side, we have software providing multiple generic ways for the user to do a task. However, on the other side, we have software providing a specialized workflow for the user to do the same task. The pros and cons of these strategies are the basis of the debate about non-opinionated and opinionated software.

In this tutorial, we’ll investigate the concepts of non-opinionated and opinionated design. First, we’ll see a brief review regarding software design that led to our target concepts. So, we’ll understand the concepts of non-opinionated software. Next, we’ll see how is the design of opinionated software. Finally, we conclude the tutorial with a systematic summary.

2. How Did We Get Here?

Before the cloud era, software solutions were typically negotiated as simple products in a store. It means that, in this scenario, we choose a product, pay for it, and use the software as we wish.

The described model, for sure, still works nowadays. However, it does present some potential pitfalls. For instance, sometimes, one-time-buying software does not receive relevant updates. So, the users must keep using a deprecated solution for a long time (till buying a new version software license).

Moving forward in time, the rising of the cloud era brings new models to negotiate software. Now, it becomes viable to sell software on-demand, providing both the software itself and the infrastructure to execute it. The software got very dynamic, with solutions hosted in the cloud and being used and updated everywhere at any time.

This movement became known as the “as-a-Service” model. In our particular case: Software-as-a-Service.

This model led the industry to rethink how to provide a software solution to their customers. Make a software workflow more flexible or efficient? Sell generic services or specialist ones? All these thoughts permeate the concepts of non-opinionated and opinionated software.

It is relevant to note that the concept of opinionated historically exists and fits in different scenarios (for example, in the context of programming languages). But, the recent trends in software development warmed up this discussion.

3. Non-opinionated Design

The main idea around non-opinionated design is that the system users have the full ability to make their own decisions. It means that, usually, the system provides several ways to accomplish a task, and the user can choose the best one according to the problem and context.

The keyword of non-opinionated solutions is flexibility. By providing multiple manners to accomplish tasks, non-opinionated solutions trust their users in decision-making. Thus, there are no global right ways to do things, but actually one best-fitted way according to the user judgment and resources.

However, having great flexibility to deal with brings some challenges too. One example consists of the spreadsheet editors. Typically, these programs enable users to create different structures to keep and process the same data. So, adopting a non-efficient spreadsheet can turn the work much more costly when compared to a more efficient one.

Similarly, bad decisions when developing software using a generic programming language, such as PERL or PHP, can lead to intricate workflows in the final software.

In summary, employing non-opinionated solutions means facing multiple branches on the road, all leading to the same place. So, the better choice is the one that fits better to the available resources.

The following image intuitively presents the central idea behind the decision-making when using non-opinionated software:

4. Opinionated Design

Generally, we can see the opinionated design as an oracle that indicates the right way to accomplish tasks. So, opinionated solutions provide the resources and clear signs of how is it expected to use them.

However, this characteristic does not mean that opinionated solutions always offer a single manner (the “right” one) to do things. Actually, users typically can adopt other workflows than the advised one. But, doing that, probably the user will face difficulties and potential problems to accomplish tasks.

A significant challenge regarding opinionated design is evaluating if a problem fits the solution. So, if the problem matches with the solution resources and workflows, choosing opinionated software boosts users to solve problems fast and with outstanding quality on the final result.

Otherwise, if the problem doesn’t match the solution, the better choice is to select another opinionated solution (that fits better) or a non-opinionated one.

A famous example of opinionated solutions is the wiki pages. Wiki is an opinionated design to update content on a web page without (or few) knowledge of HTML and CSS. Thus, if the aimed webpage fits on a wiki page, it could be much simpler and fast use it instead of developing the entire page source code.

Regarding programming frameworks, in turn, an example of opinionated design is Ruby on Rails.

In short, the opinionated design is analogous to traveling a road with multiple traffic signs in a rented vehicle. Thus, following the road is very simple, but exiting it might be a disastrous decision.

The following image depicts the concept of opinionated design considering our intuitive example:

5. Systematic Summary

In the previous sections, we investigated the non-opinionated and opinionated design regarding how the solutions interact with the users.

In summary, the main difference regards flexibility. The non-opinionated design promotes different ways to accomplish the same task. The opinionated design, in turn, typically presents a “right way” to accomplish a task.

In this scenario, flexibility may be an advantage or disadvantage. The main advantage is enabling the users to make their own decisions according to their resources and context. However, making bad decisions can compromise the performance of an entire system or process.

The following table compares some characteristics and presents some examples of non-opinionated and opinionated design:

Rendered by

Finally, it is relevant to note the existence of a golden mean between non-opinionated and opinionated design. It can be achieved by providing high flexibility to the users while defining strong conventions to accomplish tasks. In such a scenario, users have the “right way” to do things but can tackle the particularities of their contexts and problems.

6. Conclusion

In this tutorial, we learned the concepts of non-opinionated and opinionated design. At first, we got a brief context about the recent discussions of these designs. Thus, we explored the non-opinionated design. Then, similarly, we investigated the opinionated design. Finally, we overviewed and compared the concepts in a systematic summary.

With the most recent business models in the technology and software world, the discussion about non-opinionated and opinionated software became a hot topic. Naturally, there are different advantages and disadvantages regarding each one to be considered. However, at the end of the day, adopting one or another will depend on the users’ objectives and the requirements of the tasks to be accomplished.

The post Non-opinionated vs. Opinionated Design first appeared on Baeldung on Computer Science.]]> 0
Rainbow Table Attacks Mon, 03 Jan 2022 09:57:35 +0000 Learn about rainbow table attacks.

The post Rainbow Table Attacks first appeared on Baeldung on Computer Science.]]>
1. Introduction

Nowadays, there are several mechanisms intended to provide security, privacy, and authenticity in the digital world. Among these mechanisms, we can find the hash functions. Hashing is a mapping technique that transforms a given sequence of bytes of any size to another sequence of bytes of a predetermined size.

It is relevant to note that hashing has many applications. For example, hashing is suitable for storing passwords and creating digital signatures.

However, as with every mechanism that aims to provide security, hashes have particular vulnerabilities. So, attackers exploit these vulnerabilities to breach the security mechanism.

In this tutorial, we’ll study a specific attack of hashing: the rainbow table attack. Initially, we’ll have an overview of necessary hashing concepts. So, we’ll investigate the central idea of a rainbow table. Finally, we’ll explore the rainbow table attack itself, showing a simple example.

2. Hashing

In practice, hashing is a mapping technique. It uses a function to transform sequences of bytes into other sequences of bytes. The data provided to the hash function may have variable lengths. However, the data generated within the hash function (called hash code or digest) always have the same length.

The following figure depicts the described process:

It is relevant to note that a hash function is a one-way function. So, it is not possible to recover the original plaintext input given only a hash code.

A characteristic of hashing is that exists a finite number of hash codes. The hash codes’ length defines this number.

Thus, as the input’s length is variable and non-limited, we map an infinite size set to a finite size set. It means that, at some point, different plaintext inputs will generate the same hash code. This phenomenon calls collision.

However, to solve collisions and get extra security layers, it is possible to execute an operation known as salting for hashing. Salting consists of reducing a hash code in a plaintext (different from the original input) with another one-way function and hashing it again, generating a distinct code from the previous one.

The following image intuitively shows the salting process of hashing:

3. Rainbow Table

Rainbow tables are pre-computed tables to revert hash codes to the original input plaintext. In short, these tables keep the original input plaintext associated with the corresponding hash code after some operations of hashing and reduction.

Rainbow tables emerged as a mid-term solution: they do not require executing all the hashing/reductions operations at every request and provide better security than using a lookup table associating straight hash code with the original plaintext.

Lookup tables, in this context, are basically rainbow tables with non-salted hash codes.

In summary, processing several hashes and reductions for each input requires lots of time. Otherwise, keeping all the possibilities of hash codes for a given input requires lots of memory.

By creating relations between plaintexts inputs and final codes, rainbow tables facilitate the general management of hashing operations.

The following image exemplifies a simple rainbow table:

4. Rainbow Table Attacks

In summary, a rainbow table attack compares stolen hash codes with a rainbow table of a given hashing method.

In such a way, the very first step of a rainbow table attack is stealing a list of hashes. Commonly, these hashes contain private information, such as passwords and credit card numbers.

Attackers can steal a list of hashes, for example, by exploiting vulnerabilities of poorly secured databases or by using phishing techniques.

After obtaining the hashes, the attackers compare them with rainbow tables intending to find some match between them and a table entry. If there is a match, the attacker gets the plaintext of the hash.

With the original plaintext input, the attacker can, for example, try to access accounts (if the plaintext corresponds to a password) or go shopping online (if the plaintext corresponds to credit card numbers and other information).

4.1. Example Scenarios

Usually, a rainbow table attack occurs together with other types of attacks. It happens since it is necessary to access a list of hash codes, which may require other attacking techniques.

Let’s consider a possible scenario: first, the attacker executes an SQL injection in a vulnerable web server to read the database. Next, the attacker searches for a password table and accesses its data. Finally, with the list of password hash codes, the attacker uses rainbow tables to get the plaintext data.

Another possible scenario is executing a distributed denial-of-service attack to breach the security systems (such as firewalls). Thus, the attacker can invade database systems, steal hash code lists, and execute a rainbow table attack.

4.2. Countermeasures

Besides securing the database keeping the hash codes, other countermeasures can be adopted to avoid rainbow table attacks. We describe some of these countermeasures next:

  • Improve the security: a good option to prevent the consequences of a rainbow table attack from being catastrophic is not depend only on password security. Multiple factor authentication is a great option to improve security in this context
  • Do not use easy passwords: frequent passwords, such as 0123456 and qwerty, are present in most rainbow tables. Thus, using large and unusual passwords reduce the probability of it being mapped in a generic rainbow table
  • Use salting technique: salting makes it much hard to execute a rainbow table attack. Moreover, eliminating alphanumeric characters in the reduction operations can improve the security too
  • Use modern hashing algorithms: obsolete hashing algorithms, such as MD5 and SHA1, are preferred for rainbow table attacks – it is a good idea to avoid them

5. Conclusion

In this tutorial, we learned about rainbow table attacks. At first, we reviewed the basic concepts of hashing. Thus, we focused on understanding what rainbow tables are. Next, we in-depth studied the rainbow table attacks. In this context, we explored concepts, scenarios, and possible countermeasures.

We can conclude that hashing is an essential technique for the current Internet. However, as well as any security technique, it has some weaknesses explored by malicious attackers. So, to keep our systems safe, it is essential to know these weaknesses and prepare the proper countermeasures to avoid possible attacks.

The post Rainbow Table Attacks first appeared on Baeldung on Computer Science.]]> 0
What is Selection Bias and How Can We Prevent It? Fri, 31 Dec 2021 09:13:42 +0000 Explore the methods for preventing selection bias when we conduct statistical analysis.

The post What is Selection Bias and How Can We Prevent It? first appeared on Baeldung on Computer Science.]]>
1. Introduction

In this tutorial, we’ll discuss the subject of selection bias in statistical sampling and the techniques for limiting it.

2. A Matter of Life and Death

2.1. Winning Wars With Statistics

Selection bias is a type of bias that’s common in statistical analysis and machine learning alike. It can lead to errors in predictive models due to distortions in the data that we use to train them.

The story that’s commonly told when introducing the problem of selection bias relates to the work of a famous mathematician, Dr. Wald. During the war, Wald was tasked by the American Navy to study the problem of minimizing casualties in air battles.

Dr. Wald first took a list of the areas that were more frequently repaired for damages after each battle. He then determined that, counter-intuitively, the areas that were reported as being more prone to damages were the strongest areas in the chassis of an aircraft. He, therefore, suggested strengthening the parts of the aircraft that did not receive maintenance after the battles.

His colleagues disagreed with him, and understandably so. After all, how could someone claim that the most frequently damaged parts of an aircraft were the most durable? If they’re damaged very often, then certainly they’re not durable at all, or are they?

2.2. The Worst Lessons Are Taught by Survivors

Wald argued that, for an aircraft to be repaired by a mechanic in the first place, the aircraft had to survive the battle and reach the hangar safely. Not all aircraft did, and those that were too severely damaged would end up destroyed. Indeed, all aircraft that took damage in the areas not listed for maintenance, such as the cockpit or the wing engines, were those that the enemy had downed:

This short story introduces us to the problem of selection bias. The procedure for selecting a statistical sample could introduce a distortion in the statistic that we compute over that sample. If this is the case, the conclusions that we draw by analyzing it will also be affected by the same bias.

3. All Lottery Players Are Winners

A less-famous story concerns an American man from Florida, who became famous a few years ago for winning several lottery extractions in a row. In a series of interviews released by ABC News, as well as in a book that he wrote, the lottery player recommended people his secret recipe for success. “Buy lottery tickets”, he said; and if you win then “use the lottery money all the time to buy more tickets”.

We can conceptualize the decision-making process which he advocated with a flowchart:

Indeed, this strategy and his secret recipe functioned, at least for him. They led him to acquire more than one million dollars in total profits over the course of around ten years.

But more interestingly, he isn’t the only one who won by applying this strategy. There is in fact a conspicuous number of people who applied it concretely, and who have indeed obtained large earnings.

Therefore, we could argue, it’s a good idea to invest our savings in lottery tickets. After all, plenty of observational cases suggests that people do indeed win big.

4. A Choice With Prejudice

4.1. An Implicit Exclusion

There has to be an error in this line of thought. But where exactly?

The reason why the previous strategy is unadvisable is not that it leads to sure losses: indeed, some people do become millionaires by applying it. But rather, the problem has to do with focusing on the cases where the observations satisfy our preconceptions and the exclusion of any cases which contradict our expectations.

By considering only the words of those who successfully applied this particular strategy, we’re misled into thinking that its odds are favorable. We could therefore believe, in Bayesian terminology, that the a priori probability of success is significantly high. And of course, we’d be wrong.

4.2. The Problem Lies With Statisticians, Not Statistics

These considerations tell us that, according to the method which we use to select the observations upon which we conduct our analyses, we may get an understanding of reality that’s grossly incorrect.

The choice of examples that correspond to successful outcomes of a betting method is a kind of selection bias. More specifically, if the book’s author of the previous example received letters only from persons who applied that method successfully, and did not instead choose to voluntarily discard the complaints against his strategy, we’d be talking instead about survivor bias.

The reason for the latter is that, while a larger number of persons is expected to have applied the strategy than the number of persons who obtained profits, we can imagine that only the winners found a reason to write to the author out of gratitude. The remaining people, considering how they had already lost $10, may decide instead to simply cut their losses and not pay for the postal stamps required to send a letter.

These considerations tell us that selection bias isn’t a problem that concerns the phenomenon that we’re studying. Rather, it depends upon the choices of the statistician or the research that they conduct. If we, personally, do statistical analysis, then we’re the ones responsible for staying alert and preventing it.

5. How to Prevent Selection Bias

5.1. Review the Literature

Now that we understand what selection bias is and how it can influence our statistical analysis, it’s time to consider what methods are available to prevent it. In some cases full prevention may be impossible: in that case, reducing it to the largest extent possible is appropriate.

The first method we can use is the application of the principle of good practice in scientific research, which implies having updated knowledge on the subject that we study. Concretely, it’s customary to begin research with a review of the literature on the phenomenon that we’re analyzing.

Let’s assume, for example, that we’re studying the level of schooling and education in a population by administering questionnaires. We should expect the government organizations in the relevant country to have compiled reports in the past concerning the same topic. These reports, together with more properly scientific sources, would constitute the prior knowledge that we bring into the study.

5.2. Consistency Between Methodology and Theory

The second method consists of the application of an observational or sampling methodology that’s appropriate for the phenomenon that we study. If we’re still interested in learning about educational achievements, and we administer written questionnaires to the population about the self-reported level of education, we’re certain to introduce selection bias to our study. In fact, each and every illiterate person is, by definition, unable to answer the written questionnaires by reason of their inability to read.

5.3. Random Sampling and Stratification

It can also happen that, for a particular population, its individuals can be grouped into clusters that are very similar to one another. In those cases, it’s appropriate to use techniques for random and stratified sampling.

The idea behind stratified sampling is that, if some populations can be divided into internally-homogeneous groups, then the statistical sample that we analyze should contain individuals that belong to those groups, in proportion to the group’s weight over the total population.

5.4. Know Your Priors

And finally, as a general rule, when doing statistical analyses we should always go with the Bayesian motto: “Know thyself, and know thy priors.” If we have a clear idea of the research hypotheses that we embed in our study, and if we also have a good idea of the a priori distribution that we should expect to observe, we’re much more likely to spot selection bias in our work.

This would in fact emerge as a gross and systematic deviation from the theoretical a priori distribution, that can’t be explained by statistical fluctuation.

6. Conclusions

In this article, we’ve studied the methods for preventing selection bias when we conduct statistical analysis.

The post What is Selection Bias and How Can We Prevent It? first appeared on Baeldung on Computer Science.]]> 0