Warning: This post has KaTeX enabled, so if you want to view the rendered math formulas, you’ll have to unfortunately enable JavaScript.
In this post, let’s dive into a topic that is very important for anyone who uses the internet: passwords. We’ll cover what the hell is Entropy, good password practices, and how it relates to Bitcoin “seed phrases”^{1}.
Before we go into passwords, I’ll introduce the concept of Entropy.
Entropy is a measure of the amount of disorder in a system. It has its origins in Thermodynamics, where it’s used to measure the amount of energy in a system that is not available to do work.
The etymology of the word “Entropy” is after the Greek word for “transformation”.
It was given a proper statistical definition by Ludwig Boltzmann in 1870s. while establishing the field of Statistical Dynamics, a field of physics that studies the behavior of large collections of particles.
In the context of Statistical Dynamics, Entropy is a measure of the number of ways a system can be arranged. The more ways a system can be arranged, the higher its Entropy. Specifically, Entropy is a logarithmic measure of the number of system states with significant probability of being occupied:
$$S = -k \cdot \sum_i p_i \ln p_i$$
Where:
In this formula, if all states are equally likely, i.e $p_i = \frac{1}{N}$, where $N$ is the number of states, then the entropy is maximized. You can see this since a probability $p$ is a real number between 0 and 1, and as $N$ approaches infinity, the sum of the logarithms approaches negative infinity. Then, multiplying by $-k$ yields positive infinity.
There’s once a great men called Claude Shannon, who single-handedly founded the field of Information Theory, invented the concept of a Bit, and was the first to think about Boolean algebra in the context of electrical circuits. He laid the foundation for the Digital Revolution.
If you are happy using your smartphone, laptop, or any other digital device, in you high speed fiber internet connection, through a wireless router to send cats pictures to your friends, then you should thank Claude Shannon.
He was trying to find a formula to quantify the amount of information in a message. He wanted three things:
He pretty much found that the formula for Entropy in statistical mechanics was a good measure of information. He called it Entropy to honor Boltzmann’s work. To differentiate it from the Statistical Dynamics’ Entropy, he changed the letter to $H$, in honor of Boltzmann’s $H$-theorem. So the formula for the Entropy of a message is:
$$H(X) = −\Sigma_{x \in X} P(x_i) \log P(x_i)$$
Where:
In information theory, the Entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent to the variable’s possible outcomes^{2}.
Let’s take the simple example of a fair coin. The Entropy of the random variable $X$ that represents the outcome of a fair coin flip is:
$$H(X) = −\Sigma_{x \in X} P(x_i) \log P(x_i) = -\left(\frac{1}{2} \log \frac{1}{2} + \frac{1}{2} \log \frac{1}{2}\right) = 1 \text{ bit}$$
So the outcome of a fair coin flip has 1 bit of Entropy. This means that the outcome of a fair coin flip has 1 bit of information, or 1 bit of uncertainty. Once the message is received, that the coin flip was heads or tails, the receiver has 1 bit of information about the outcome.
Alternatively, we only need 1 bit to encode the outcome of a fair coin flip. Hence, there’s a connection between Entropy, search space, and information.
Another good example is the outcome of a fair 6-sided die. The Entropy of the random variable $X$ that represents the outcome of a fair 6-sided die is:
$$H(X) = −\Sigma_{x \in X} P(x_i) \log P(x_i) = - \sum_{i=1}^6\left(\frac{1}{6} * \log \frac{1}{6} \right) \approx 2.58 \text{ bits}$$
This means that the outcome of a fair 6-sided die has 2.58 bits of Entropy. we need $\operatorname{ceil}(2.58) = 3$ bits to encode the outcome of a fair 6-sided die.
Ok now we come full circle. Let’s talk, finally, about passwords.
In the context of passwords, Entropy is a measure of how unpredictable a password is. The higher the Entropy, the harder it is to guess the password. The Entropy of a password is measured in bits, and it’s calculated using the formula:
$$H = L \cdot \log_2(N)$$
Where:
For example, if we have a password with 8 characters and each character can be any of the 26 lowercase letters, the standard english alphabet, the Entropy would be:
$$H = 8 \cdot \log_2(26) \approx 37.6 \text{ bits}$$
This means that an attacker would need to try $2^{37.6} \approx 2.01 \cdot 10^{11}$ combinations^{3} to guess the password.
If the password were to include uppercase letters, numbers, and symbols (let’s assume 95 possible characters in total), the Entropy for an 8-character password would be:
$$H = 8 \cdot \log_2(95) \approx 52.6 \text{ bits}$$
This means that an attacker would need to try $2^{52.6} \approx 6.8 \cdot 10^{15}$ combinations to guess the password.
This sounds a lot but it’s not that much.
For the calculations below, we’ll assume that the attacker now your dictionary set, i.e. the set of characters you use to create your password, and the password length.
If an attacker get a hold of an NVIDIA RTX 4090, MSRP USD 1,599, which can do 300 GH/s (300,000,000,000 hashes/second), i.e. $3 \cdot 10^{11}$ hashes/second, it would take:
$$\frac{2.01 \cdot 10^{11}}{3 \cdot 10^{11}} \approx 0.67 \text{ seconds}$$
$$\frac{6.8 \cdot 10^{15}}{3 \cdot 10^{11}} \approx 22114 \text{ seconds} \approx 6.14 \text{ hours}$$
So, the first password would be cracked in less than a second, while the second would take a few hours. This with just one 1.5k USD GPU.
Now that we understand Entropy and how it relates to passwords, let’s talk about bitcoin seed phrases^{1}.
Remember that our private key is a big-fucking number? If not, check my post on cryptographics basics.
BIP-39 specifies how to use easy-to-remember seed phrases to store and recover private keys. The wordlist adheres to the following principles:
Here is a simple 7-word seed phrase: brave sadness grocery churn wet mammal tube
.
Surprisingly enough, this badboy here gives you $77$ bits of Entropy,
while also being easy to remember.
This is due to the fact that the wordlist has 2048 words,
so each word gives you $\log_2(2048) = 11$ bits of Entropy^{4}.
There’s a minor caveat to cover here. The last word in the seed phrase is a checksum, which is used to verify that the phrase is valid.
So, if you have a 12-word seed phrase, you have $11 \cdot 11 = 121$ bits of Entropy. And for a 24-word seed phrase, you have $23 \cdot 11 = 253$ bits of Entropy.
The National Institute of Standards and Technology (NIST) recommends a minimum of 112 bits of Entropy for all things cryptographic. And Bitcoin has a minimum of 128 bits of Entropy.
Depending on your threat model, “Assume that your adversary is capable of a trillion guesses per second”, it can take a few years to crack a 121-bit Entropy seed phrase:
$$\frac{2^{121}}{10^{12}} \approx 2.66 \cdot 10^{24} \text{ seconds} \approx 3.08 \cdot 10^{19} \text{ days} \approx 8.43 \cdot 10^{16} \text{ years}$$
That’s a lot of years. Now for a 253-bit Entropy seed phrase:
$$\frac{2^{253}}{10^{12}} \approx 1.45 \cdot 10^{64} \text{ seconds} \approx 1.68 \cdot 10^{59} \text{ days} \approx 4.59 \cdot 10^{56} \text{ years}$$
That’s another huge number of years.
You can also use a seed phrase as a password. The bonus point is that you don’t need to use the last word as a checksum, so you get 11 bits of Entropy free, compared to a Bitcoin seed phrase.
Remember the 7-words badboy seed phrase we generated earlier?
brave sadness grocery churn wet mammal tube
.
It has $66$ bits of Entropy. This would take, assuming “that your adversary is capable of a trillion guesses per second”:
$$\frac{2^{77}}{10^{12}} \approx 1.51 \cdot 10^{11} \text{ seconds} \approx 1.75 \cdot 10^{6} \text{ days} \approx 4.79 \cdot 10^{3} \text{ years}$$
That’s why tons of people use seed phrases as passwords. Even if you know the dictionary set and the length of the password, i.e. the number of words in the seed phrase, it would take a lot of years to crack it.
Entropy is a measure of the amount of disorder in a system. In the context of passwords, it’s a measure of how unpredictable a password is. The higher the Entropy, the harder it is to guess the password.
Bitcoin seed phrases are a great way to store and recover private keys. They are easy to remember and have a high amount of Entropy. You can even use a seed phrase as a password.
Even it your attacker is capable of a trillion guesses per second, like the NSA, it would take them a lot of years to crack even a 7-word seed phrase.
If you want to generate a seed phrase, you can use KeePassXC, which is a great open-source offline password manager that supports seed phrases^{5}.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
seed phrases are technically called “mnemonic phrases”, but I’ll use the term “seed phrases” for the rest of the post. ↩︎ ↩︎
there is a Bayesian argument about the use of priors that should adhere to the Principle of Maximal Entropy ↩︎
technically, we need to divide the number of combinations by 2, since we are assuming that the attacker is using a brute-force attack, which means that the attacker is trying all possible combinations, and the password could be at the beginning or at the end of the search space. This is called the birthday paradox, and it assumes that the password is uniformly distributed in the search space. ↩︎
remember that $2^{11} = 2048$. ↩︎
technically, KeePassXC uses the EFF wordlist, which has 7,776 words, so each word gives you $\log_2(7776) \approx 12.9$ bits of Entropy. They were created to be easy to use with 6-sided dice. ↩︎
Euclid’s one-way function
Warning: This post has KaTeX enabled, so if you want to view the rendered math formulas, you’ll have to unfortunately enable JavaScript.
This is the companion post to the cryptography workshop that I gave at a local BitDevs. Let’s explore the basics of cryptography. We’ll go through the following topics:
A one-way function is a function that is easy to compute on every input, but hard to invert given the image^{1} of a random input. For example, imagine an omelet. It’s easy to make an omelet from eggs, but it’s hard to make eggs from an omelet. In a sense we can say that the function $\text{omelet}$ is a one-way function
$$\text{omelet}^{-1}(x) = \ldots$$
That is, we don’t know how to invert the function $\text{omelet}$ to get the original eggs back. Or, even better, the benefit we get from reverting the omelet to eggs is not worth the effort, either in time or money.
Not all functions are one-way functions. The exponential function, $f(x) = e^x$, is not a one-way function. It is easy to undo the exponential function by taking the natural logarithm,
$$f^{-1}(x) = \ln(x)$$
To showcase one-way functions, let’s take a look at the following example. Let’s play around with some numbers. Not any kind of numbers, but very special numbers called primes. A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself.
If I give you a big number $n$ and ask you to find its prime factors, and point a gun at your head, you’ll pretty much screwed. There’s no known efficient algorithm^{2} to factorize a big number into its prime factors. You’ll be forced to test all numbers from 2 to $\sqrt{n}$ to see if they divide $n$.
Here’s a number:
$$90809$$
What are its prime factors? It’s $1279 \cdot 71$. Easy to check, right? Hard to find. That’s because prime factorization, if you choose a fucking big number, is a one-way function.
Let’s spice things up. There is a special class of one-way functions called hash functions.
A hash function is any function that can be used to map data of arbitrary size to fixed-size values.
But we are most interested in cryptographic hash functions, which are hash functions that have statistical properties desirable for cryptographic application:
These properties make enable cryptographic hash functions to be used in a wide range of applications, including but not limited to:
Digital signatures: Hash functions are used to create a digest of the message to be signed. The digital signature is then generated using the hash, rather than the message itself, to ensure integrity and non-repudiation.
Password hashing: Storing passwords as hash values instead of plain text. Even if the hash values are exposed, the original passwords remain secure due to the pre-image resistance property.
Blockchain and cryptocurrency: Hash functions are used to maintain the integrity of the blockchain. Each block contains the hash of the previous block, creating a secure link. Cryptographic hashes also underpin various aspects of cryptocurrency transactions.
Data integrity verification: Hash functions are used to ensure that files, messages, or data blocks have not been altered. By comparing hash values computed before and after transmission or storage, any changes in the data can be detected.
We’ll cover just the digital signatures part in this post.
The Secure Hash Algorithm 2 (SHA-2) is a set of cryptographic hash functions designed by the National Security Agency (NSA). It was first published in 2001.
It is composed of six hash functions with digests that are 224, 256, 384, 512, 512/224, and 512/256 bits long:
SHA-224
SHA-256
SHA-384
SHA-512
SHA-512/224
SHA-512/256
Amongst these, let’s focus on SHA-256, which is the most widely used while also being notoriously adopted by bitcoin.
SHA-256 does not have any known vulnerabilities and is considered secure. It comprises of 32-bit words and operates on 64-byte blocks. The algorithm does 64 rounds of the following operations:
AND
: bitwise boolean ANDXOR
: bitwise boolean XOROR
: bitwise boolean ORROT
: right rotation bit shiftADD
: addition modulo $2^{32}$You can check SHA-256 Pseudocode on Wikipedia. It really scrambles the input message in a way that is very hard to reverse.
These operations are non-linear and very difficult to keep track of. In other words, you can’t reverse-engineer the hash to find the original message. There’s no “autodiff” for hash functions.
Since it is a cryptographic hash function, if we change just one bit of the input, the output will be completely different. Check this example:
$ echo "The quick brown fox jumps over the lazy dog" | shasum -a 256
c03905fcdab297513a620ec81ed46ca44ddb62d41cbbd83eb4a5a3592be26a69 -
$ echo "The quick brown fox jumps over the lazy dog." | shasum -a 256
b47cc0f104b62d4c7c30bcd68fd8e67613e287dc4ad8c310ef10cbadea9c4380 -
Here we are only adding a period at the end of the sentence, and the hash is completely different. This is due to the property of collision resistance that we mentioned earlier.
Before we dive into public-key cryptography, we need a brief interlude on fields.
Fields are sets with two binary operations, called addition $+$ and multiplication $\times$. We write
$$F = (F, +, \times)$$
to denote a field, where $F$ is the set, $+$ is the addition operation, and $\times$ is the multiplication operation.
Addition and multiplication behave similar to the addition and multiplication of real numbers. For example, addition is commutative and associative
$$a + b = b + a,$$
and multiplication is distributive
$$a \times (b + c) = a \times b + a \times c.$$
Also, there are two special elements in the field, called the additive identity $-a$ and the multiplicative identity $a^{-1}$, such that
$$a + (-a) = I,$$
and
$$a \times a^{-1} = I,$$
where $I$ is the identity element.
Note that this allows us to define subtraction
$$a - b = a + (-b),$$
and division
$$a \div b = a \times b^{-1}.$$
Now we are ready for finite fields. A finite field, also called a Galois field (in honor of Évariste Galois), is a field with a finite number of elements. As with any field, a finite field is a set on which the operations of multiplication, addition, subtraction and division are defined and satisfy the rules above for fields.
Finite fields is a very rich topic in mathematics, and there are many ways to construct them. The easiest way to construct a finite field is to take the integers modulo a prime number $p$. For example $\mathbb{Z}_5$ is a finite field with 5 elements:
$$\mathbb{Z}_5 = \lbrace 0, 1, 2, 3, 4 \rbrace.$$
In general, $\mathbb{Z}_n$ is a finite field with $n$ elements:
$$\mathbb{Z}_n = \lbrace 0, 1, 2, \ldots, n - 1 \rbrace.$$
The number of elements in a finite field is called the order of the field. The order of a finite field is always a prime number $p$. The $\mathbb{Z}_5$ example above is a finite field of order 5. However, $\mathbb{Z}_4$ is not a finite field, because 4 is not a prime number, but rather a composite number.
$$4 = 2 \times 2.$$
And we can write $\mathbb{Z}_4$ as
$$\mathbb{Z}_4 = 2 \times \mathbb{Z}_2.$$
This means that every element in $a \in \mathbb{Z}_4$ can be written as
$$a = 2 \times b,$$
where $b$ is an element in $\mathbb{Z}_2$.
Hence, not every element of $\mathbb{Z}_4$ is unique, and they are equivalent to the elements in $\mathbb{Z}_2$.
In general if $n$ is a composite number, then $\mathbb{Z}_n$ is not a finite field. However, if $n = r \times s$ where $r$ and $s$ are prime numbers, and $r < s$, then $\mathbb{Z}_n$ is a finite field of order $r$.
Addition in finite fields is defined as the remainder of the sum of two elements modulo the order of the field.
For example, in $\mathbb{Z}_3$,
$$1 + 2 = 3 \mod 3 = 0.$$
We can also define subtraction in finite fields as the remainder of the difference of two elements modulo the order of the field.
For example, in $\mathbb{Z}_3$,
$$1 - 2 = -1 \mod 3 = 2.$$
Multiplication in finite fields can be written as multiple additions. For example, in $\mathbb{Z}_3$,
$$2 \times 2 = 2 + 2 = 4 \mod 3 = 1.$$
Exponentiation in finite fields can be written as multiple multiplications. For example, in $\mathbb{Z}_3$,
$$2^2 = 2 \times 2 = 4 \mod 3 = 1.$$
As you can see addition, subtraction, and multiplication becomes linear operations. This is very trivial for any finite field.
However, for division we are pretty much screwed. It is really hard to find the multiplicative inverse of an element in a finite field. For example, suppose that we have numbers $a,b$ in a very large finite field $\mathbb{Z}_p$, such that
$$c = a \times b \mod p.$$
Then we can write division as
$$a = c \div b = c \times b^{-1} \mod p.$$
Now we need to find $b^{-1}$, which is the multiplicative inverse of $b$. This is called the discrete logarithm problem. Because we need to find $b^{-1}$ such that
$$b^{-1} = \log_b c \mod p.$$
Since this number is a discrete number and not a real number, that’s why it’s called the discrete logarithm problem.
Good luck my friend, no efficient method is known for computing them in general. You can try brute force, but that’s not efficient.
To get a feeling why the discrete logarithm problem is difficult, let’s add one more concept to our bag of knowledge. Every finite field has generators, also known as primitive roots, which is also a member of the group, such that applying multiplication to this one single element makes possible to generate the whole finite field.
Let’s illustrate this with an example. Below we have a table of all the results of the following operation
$$b^x \mod 7$$
for every possible value of $x$. As you’ve guessed right this is the $\mathbb{Z}_7$ finite field.
$b$ | $b^1 \mod 7$ | $b^2 \mod 7$ | $b^3 \mod 7$ | $b^4 \mod 7$ | $b^5 \mod 7$ | $b^6 \mod 7$ |
---|---|---|---|---|---|---|
$1$ | $1$ | $1$ | $1$ | $1$ | $1$ | $1$ |
$2$ | $2$ | $4$ | $1$ | $2$ | $4$ | $1$ |
$3$ | $3$ | $2$ | $6$ | $4$ | $5$ | $1$ |
$4$ | $4$ | $2$ | $1$ | $4$ | $2$ | $1$ |
$5$ | $5$ | $4$ | $6$ | $2$ | $3$ | $1$ |
$6$ | $6$ | $1$ | $6$ | $1$ | $1$ | $1$ |
You see that something interesting is happening here. For specific values of $b$, such as $b = 3$, and $b = 5$, we are able to generate the whole finite field. Hence, say that $3$ and $5$ are generators or primitive roots of $\mathbb{Z}_7$.
Now suppose I ask you to find $x$ in the following equation
$$3^x \mod p = 11$$
where $p$ is a very large prime number. Then you don’t have any other option than brute forcing it. You’ll need to try each exponent $x \in \mathbb{Z}_p$ until you find the one that satisfies the equation.
Notice that this operation is very asymmetric. It is very easy to compute $3^x \mod p$ for any $x$, but it is very hard to find $x$ given $3^x \mod p$.
Now we are ready to dive into public-key cryptography.
Let’s illustrate the discrete logarithm problem with a numerical example.
The discrete logarithm problem is to find $x$ given $g^x \mod p$. So let’s plug in the numbers; find $x$ in
$$3^x = 15 \mod 17 $$
Try to find it. Good luck^{5}.
Public-key cryptography, or asymmetric cryptography, is a cryptographic system that uses pairs of keys: private and public. The public key you can share with anyone, but the private key you must keep secret. The keys are related mathematically, but it is computationally infeasible to derive the private key from the public key. In other words, the public key is a one-way function of the private key.
Before we dive into the details of the public-key cryptography, and signing and verifying messages, let me introduce some notation:
If you know $S_k$ and $g$ (which is almost always part of the spec), then it’s easy to derive the $P_k$. However, if you only know $g$ and $P_k$, good luck finding $S_k$. It’s the discrete log problem again. And as long $p$ is HUGE you are pretty confident that no one will find your secret key from your public key.
Now what we can do with these keys and big prime numbers? We’ll we can sign a message with our secret key and everyone can verify the authenticity of the message using our public key. The message in our case it is commonly a hash function of the “original message”. Due to the collision resistance property, we can definitely assert that:
Fun fact, I once gave a recommendation letter to a very bright student, that was only a plain text file signed with my private key. I could rest assured that the letter was not altered, and the student and other people could verify that I was the author of the letter.
Next, we’ll dive into the details of the Digital Signature Algorithm (DSA) and the Schnorr signature algorithm.
DSA stands for Digital Signature Algorithm. It was first proposed by the National Institute of Standards and Technology (NIST) in 1991. Note that OpenSSH announced that DSA is scheduled for removal in 2025.
Here’s how you can sign a message using DSA:
And here’s how you can verify the signature:
How this works? Let’s go through a proof of correctness. I added some comments to every operation in parentheses to make it easier to follow.
There you go. This attest that the signature is correct and the message was signed by the owner of the private key.
Schnorr signature algorithm is a very similar algorithm to DSA. It was proposed by Claus-Peter Schnorr in 1989. It is considered to be more secure than DSA and is also more efficient. The patent for Schnorr signatures expired in 2008, just in time for Satoshi to include it in Bitcoin. However, it was probably not included due to the fact that there wasn’t good battle-tested software implementations of it at the time. However, it was added to Bitcoin in the Taproot upgrade^{6}.
Schnorr is a marvelous algorithm. It is so much simpler than DSA. Here’s how you sign a message using Schnorr:
And here’s how you can verify the signature:
How this works? Let’s go through a proof of correctness. As before, I added some comments to every operation in parentheses to make it easier to follow.
There you go. This attest that the signature is correct and the message was signed by the owner of the private key.
Never, ever, reuse a nonce. Why? First, because nonce is short for “number used once”. It is supposed to be used only once. Because if you reuse a nonce, then you are pretty much screwed. An attacker can derive your private key from two signatures with the same nonce. This is called the “nonce reuse attack”.
Fun fact: this is what happened to the PlayStation 3.
Let’s see how we can derive the private key from two signatures with the same nonce. Here we are in a context that we have two signatures $s$ and $s^\prime$, both using the same nonce $k = k^\prime$.
First, let’s do the ugly DSA math:
$$\begin{aligned} s^\prime - s &= (k^{\prime {-1}} (H(m_1) + S_k K’)) - (k^{-1} (H(m_2) + S_k K)) \\ s^\prime - s &= k^{-1} (H(m_1) - H(m_2)) \\ k &= (H(m_1) - H(m_2)) (s^\prime - s)^{-1} \end{aligned}$$
Now remember you know $s$, $s^\prime$, $H(m_1)$, $H(m_2)$ $K$, and $K^\prime$. Let’s do the final step and solve for $S_k$:
$$S_k = K^{-1} (k s - H(m_1))$$
Now let’s do the Schnorr math. But in Schnorr, everything is simpler. Even nonce reuse attacks.
$$s^\prime - s = (k^\prime - k) - S_k (e^\prime - e)$$
If $k^\prime = k$ (nonce reuse) then you can easily isolate $S_k$ with simple algebra.
Remember: you know $s^\prime, s, e, e^\prime$ and $k^\prime - k = 0$.
In Bitcoin, we can combine Schnorr signatures and not DSA. Why? Because Schnorr signatures are linear. This means that you can add two Schnorr signatures and get a valid signature for the sum of the messages. This is not possible with DSA. This is called the “linearity property” of Schnorr signatures.
Remember that in $Z_p$ addition, multiplication, and exponentiation, i.e anything with $+, \cdot, -$, are linear operations However, division (modular inverse), .i.e anything that is $^{-1}$, is not linear. That is:
$$x^{-1} + y^{-1} != (x + y)^{-1}.$$
Here’s a trivial python code that shows that modular inverse is not linear:
>>> p = 71; x = 13; y = 17;
>>> pow(x, -1, p) + pow(y, -1, p) == pow(x + y, -1, p)
False
Let’s revisit the signature step of DSA and Schnorr:
So if you have two Schnorr signatures $s_1$ and $s_2$ for two messages $m_1$ and $m_2$, then you can easily compute a valid signature for the sum of the messages $m_1 + m_2$:
$$s = s_1 + s_2$$
Also note that we can combine Schnorr public keys:
$$P^\prime_k + P_k = g^{S^\prime_k} + g^{S_k} = g^{S_k^\prime + S_k}$$
And the signature $s$ for the sum of the messages $m_1 + m_2$ can be verified with the public key $P^\prime_k + P_k$.
This is not possible with DSA.
Because the signature step in DSA is not linear, it has a $k^{-1}$ in it.
Technically speaking, Bitcoin uses the Elliptic Curve Digital Signature Algorithm (ECDSA), and the Schnorr signature algorithm is based on the same elliptic curve (EC) as ECDSA.
And trivially speaking EC public-key cryptography in the end is just a finite field on $\mathbb{Z}_p$. It has everything that we’ve seen so far:
I hope you enjoyed this companion post to the cryptography workshop. Remember don’t reuse nonces.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
the image of a function $f$ is the set of all values that $f$ may produce. ↩︎
the problem of factoring a number into its prime factors is not known to be in the class of problems that can be solved in polynomial time, P. It is not known to be NP-complete, NP, either. Actually to find it P is NP or not is the hardest way to earn a million dollars, the P vs NP problem. ↩︎
this is called surjection. ↩︎
at least $\frac{1}{N}$ where $N$ is the size of $Y$. ↩︎
The answer is $x = 6$. This means that $3^6 = 15 \mod 17$. ↩︎
Taproot is a proposed Bitcoin protocol upgrade that was deployed as a forward-compatible soft fork. The validation of Taproot is based on Schnorr signatures. You can find more in BIPS 340, 341, and 342. ↩︎
It all started when I had to accompany my mom to the hospital. It was just a routine checkup, but I had to wait for a few hours. I brought my laptop with me, since they have good WiFi and I could work on my projects. Then I realized that my mom was playing a Sudoku^{1} game on her phone. I couln’t help but notice that the game was full of ads and it was asking for a lot of permissions, like location and sensor data. So I decided to make a Sudoku game for her, without ads or using any permission. It wouldn’t even need to ask for the blessing of Google or Tim Apple since it was a Progressive Web App (PWA) and it would work offline.
You can play the game at storopoli.io/sudoku
or check the source code at storopoli/sudoku
.
Here’s a screenshot of the game:
So what would I use to build this game? Only one thing: Dioxus. Dioxus is a fullstack framework for Rust, that allows you to build web applications with Rust. You can benefit from the safety and performance of Rust, powerful type system and borrow checker, along with the low memory footprint.
That’s it. Just Rust and HTML with some raw CSS.
No “YavaScript”. No Node.js. No npm. No webpack. No Tailwind CSS.
Just cargo run --release
and you’re done.
Using Rust for fullstack development is an amazing thing.
First, package management is a breeze with Cargo.
Second, you don’t have to worry about “npm vulnerabilities”.
Have you ever gone into your project and ran npm audit
?
This is solvable with Rust.
An additional advantage is that you don’t have to worry about common
runtime errors like undefined is not a function
or null is not an object
.
These are all picked-up by Rust on compile time.
So you can focus on the logic of your application knowing that it will work as expected.
A common workflow in Rust fullstack applications is to use Rust’s powerful type system to parse any user input into a type that you can trust, and then propagate that type throughout your application. This way you can be sure that you’re not going to have any runtime errors due to invalid input. This is not the case with “YavaScript”. You need to validate the input at every step of the way, and you can’t be sure that the input is valid at any point in time.
You can sleep soundly at night knowing that your application won’t crash and as long as the host machine has electricity and internet access, your app is working as expected^{2}.
Rust is known for its performance. This is due to the fact that Rust gives you control over deciding on which type you’ll use for a variable. This is not the case with “YavaScript”, where you can’t decide if a variable is a number or a string. Also you can use references and lifetimes to avoid copying data around.
So, if you make sane decisions, like u8
(unsigned 8-bit integer) instead of i32
(signed 32-bit integer)
for a number that will never be greater than 255, you can have a very low memory footprint.
Also you can use &str
(string slice) instead of String
to avoid copying strings around.
You just don’t have this level of control with “YavaScript”. You get either strings or numbers and you can’t decide on the size of the number. And all of your strings will be heap-allocated and copied around.
Progressive Web Apps (PWAs) are web applications that are regular web pages or websites, but can appear to the user like traditional applications or native mobile applications. Since they use the device’s browser, they don’t need to be installed through an app store. This is a great advantage, since you don’t have to ask for permissions to Google or Tim Apple.
In Dioxus making a PWA was really easy.
There is a PWA template in the examples/
directory in their repository.
You just have to follow the instructions in the README and you’re done.
In my case, I only had to change the metadata in the manifest.json
file
and add what I wanted to cache in the service worker .js
file.
These were only the favicon icon and the CSS style file.
I didn’t have to worry about the algorithm to generate the Sudoku board.
This was already implemented in the sudoku
crate.
But I had to implement some Sudoku logic to make the user interface work.
Some things that I had to implement were:
This was a simple task, yet it was very fun to implement.
To get the related cells, you need to find the row and column of the cell. Then you can find the start row and start column of the 3x3 sub-grid. After that, you can add the cells in the same row, column and sub-grid to a vector. Finally, you can remove the duplicates and the original cell from the vector.
Here’s the code:
pub fn get_related_cells(index: u8) -> Vec<u8> {
let mut related_cells = Vec::new();
let row = index / 9;
let col = index % 9;
let start_row = row / 3 * 3;
let start_col = col / 3 * 3;
// Add cells in the same row
for i in 0..9 {
related_cells.push(row * 9 + i);
}
// Add cells in the same column
for i in 0..9 {
related_cells.push(i * 9 + col);
}
// Add cells in the same 3x3 sub-grid
for i in start_row..start_row + 3 {
for j in start_col..start_col + 3 {
related_cells.push(i * 9 + j);
}
}
// Remove duplicates and the original cell
related_cells.sort_unstable();
related_cells.dedup();
related_cells.retain(|&x| x != index);
related_cells
}
To find the conflicting cells, you need to get the value of the target cell. Then you can get the related cells and filter the ones that have the same value as the target cell. Easy peasy.
Here’s the code:
pub fn get_conflicting_cells(board: &SudokuState, index: u8) -> Vec<u8> {
// Get the value of the target cell
let value = board[index as usize];
// Ignore if the target cell is empty (value 0)
if value == 0 {
return Vec::new();
}
// Get related cells
let related_cells = get_related_cells(index);
// Find cells that have the same value as the target cell
related_cells
.into_iter()
.filter(|&index| board[index as usize] == value)
.collect()
}
Note that I am using 0
to represent empty cells.
But if the user ignores the conflicting cells and adds a number to the board, there will be more conflicting cells than the ones related to the target cell. This can be done with another helper function.
Here’s the code, and I took the liberty of adding the docstrings (the ///
comments that renders as documentation):
/// Get all the conflictings cells for all filled cells in a Sudoku board
///
/// ## Parameters
///
/// - `current_sudoku: SudokuState` - A reference to the current [`SudokuState`]
///
/// ## Returns
///
/// Returns a `Vec<u8>` representing all cell's indices that are conflicting
/// with the current Sudoku board.
pub fn get_all_conflicting_cells(current_sudoku: &SudokuState) -> Vec<u8> {
let filled: Vec<u8> = current_sudoku
.iter()
.enumerate()
.filter_map(|(idx, &value)| {
if value != 0 {
u8::try_from(idx).ok()
} else {
None // Filter out the item if the value is 0
}
})
.collect();
// Get all conflicting cells for the filled cells
let mut conflicting: Vec<u8> = filled
.iter()
.flat_map(|&v| get_conflicting_cells(current_sudoku, v))
.collect::<Vec<u8>>();
// Retain unique
conflicting.sort_unstable();
conflicting.dedup();
conflicting
}
The trick here is that we are using a flat_map
since a naive map
would return a nested Vec<Vec<Vec<...>>>
of u8
s, and we don’t want that.
We want a flat Vec<u8>
of all conflicting cells.
Recursion is always tricky, go ask Alan Turing.
As you can see, I used a SudokuState
type to represent the state of the game.
This is just a type alias for a [u8; 81]
array.
This is a very simple and efficient way to represent the state of the game.
Here’s the code:
pub type SudokuState = [u8; 81];
The Sudoku app has also an undo button.
This is implemented by using a Vec<SudokuState>
to store the history of the game.
Every time that the user adds a number to the board, the new update state is pushed to the history vector.
When the user clicks the undo button, the last state is popped from the history vector and the board is updated.
There’s one additional problem with the undo button.
It needs to switch the clicked cell to the one that was clicked before.
Yet another simple, but fun, task.
First you need to find the index at which two given SudokuState
, the current and the last,
differ by exactly one item.
Again I’ll add the docstrings since they incorporate some good practices that are worth mentioning:
/// Finds the index at which two given [`SudokuState`]
/// differ by exactly one item.
///
/// This function iterates over both arrays in lockstep and checks for a
/// pair of elements that are not equal.
/// It assumes that there is exactly one such pair and returns its index.
///
/// ## Parameters
///
/// * `previous: SudokuState` - A reference to the first [`SudokuState`] to compare.
/// * `current: SudokuState` - A reference to the second [`SudokuState`] to compare.
///
/// ## Returns
///
/// Returns `Some(usize)` with the index of the differing element if found,
/// otherwise returns `None` if the arrays are identical (which should not
/// happen given the problem constraints).
///
/// ## Panics
///
/// The function will panic if cannot convert any of the Sudoku's board cells
/// indexes from `usize` into a `u8`
///
/// ## Examples
///
/// ```
/// let old_board: SudokuState = [0; 81];
/// let mut new_boad: SudokuState = [0; 81];
/// new_board[42] = 1; // Introduce a change
///
/// let index = find_changed_cell(&old_board, &new_board);
/// assert_eq!(index, Some(42));
/// ```
pub fn find_changed_cell(previous: &SudokuState, current: &SudokuState) -> Option<u8> {
for (index, (&cell1, &cell2)) in previous.iter().zip(current.iter()).enumerate() {
if cell1 != cell2 {
return Some(u8::try_from(index).expect("cannot convert from u8"));
}
}
None // Return None if no change is found (which should not happen in your case)
}
The function find_changed_cell
can panic if it cannot convert any of the Sudoku’s board cells indexes from usize
into a u8
.
Hence, we add a ## Panics
section to the docstring to inform the user of this possibility.
Additionally, we add an ## Examples
section to show how to use the function.
These are good practices that are worth mentioning^{3} and I highly encourage you to use them in your Rust code.
Another advantage of using Rust is that you can write tests for your code
without needing to use a third-party library.
It is baked into the language and you can run your tests with cargo test
.
Here’s an example of a test for the get_conflicting_cells
function:
#[test]
fn test_conflicts_multiple() {
let board = [
1, 0, 0, 0, 0, 0, 0, 0, 1, // Row 1 with conflict
0, 1, 0, 0, 0, 0, 0, 0, 0, // Row 2 with conflict
0, 0, 0, 0, 0, 0, 0, 0, 0, // Row 3
0, 0, 0, 0, 0, 0, 0, 0, 0, // Row 4
0, 0, 0, 0, 0, 0, 0, 0, 0, // Row 5
0, 0, 0, 0, 0, 0, 0, 0, 0, // Row 6
0, 0, 0, 0, 0, 0, 0, 0, 0, // Row 7
0, 0, 0, 0, 0, 0, 0, 0, 0, // Row 8
1, 0, 0, 0, 0, 0, 0, 0, 0, // Row 9 with conflict
];
assert_eq!(get_conflicting_cells(&board, 0), vec![8, 10, 72]);
}
And also two tests for the find_changed_cell
function:
#[test]
fn test_find_changed_cell_single_difference() {
let old_board: SudokuState = [0; 81];
let mut new_board: SudokuState = [0; 81];
new_board[42] = 1; // Introduce a change
assert_eq!(find_changed_cell(&old_board, &new_board), Some(42));
}
#[test]
fn test_find_changed_cell_no_difference() {
let old_board: SudokuState = [0; 81];
// This should return None since there is no difference
assert_eq!(find_changed_cell(&old_board, &old_board), None);
}
I had a lot of fun building this game. I gave my mother an amazing gift that she’ll treasure forever. Her smartphone has one less spyware now. I deployed a fullstack web app with Rust that is fast, safe and efficient; with the caveat that I didn’t touched any “YavaScript” or complexes build tools.
I hope you enjoyed this post and that you’ll give Rust a try in your next fullstack project.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
According to Wikipedia, Sudoku is a logic-based, combinatorial number-placement puzzle. The objective is to fill a 9×9 grid with digits so that each column, each row, and each of the nine 3×3 subgrids that compose the grid contain all of the digits from 1 to 9. ↩︎
in my case I am sending the bill to Bill Gates, since it is using the GitHub Pages to host the app. ↩︎
The clippy
linter can warn you if you don’t add these sections to your docstrings.
Just add pedantic = "deny"
inside your Cargo.toml
file in the [lints.clippy]
section and you’re good to go. ↩︎
Warning: This post has
mermaid.js
enabled, so if you want to view the rendered diagrams, you’ll have to unfortunately enable JavaScript.
I love to learn new things and I’m passionate about Stoic philosophy.
So, when I acquired the domain
stoicquotes.io
^{1},
I’ve decided to give htmx
a try.
htmx
?htmx
is a small JavaScript library that allows you to enhance your HTML with
attributes to perform AJAX (Asynchronous JavaScript and XML) without writing
JavaScript^{2}. It focuses on extending HTML by adding custom attributes
that describe how to perform common dynamic web page behaviors like partial page
updates, form submission, etc. htmx
is designed to be easy to use, requiring
minimal JavaScript knowledge, so that you can add interactivity^{3} to web pages
with just HTML.
Let’s contrast this with the Soy stuff like the notorious React framework. React, on the other hand, is a JavaScript library for building user interfaces, primarily through a component-based architecture. It manages the creation of user interface elements, updates the UI efficiently when data changes, and helps keep your UI in sync with the state of your application. React requires a deeper knowledge of JavaScript and understanding of its principles, such as components, state, and props.
In simple terms:
htmx
enhances plain HTML by letting you add attributes for dynamic
behaviors, so you can make webpages interactive with no JavaScript coding;
you can think of it as boosting your HTML to do more.Additionally, React can be slower and less performant than htmx
.
This is due to htmx
manipulating the actual
DOM itself,
while React updates objects in the Virtual DOM. Afterward, React compares the
new Virtual DOM with a pre-update version and calculates the
most efficient way to make these changes to the real DOM.
So React has to do this whole trip around diff’ing all the time the Virtual DOM
against the actual DOM for every fucking change.
Finally, htmx
receives pure HTML from the server.
React needs to the JSON busboy thing: the server sends JSON, React parses
JSON into JavaScript code, then it parses it again to HTML for the browser.
Here are some mermaid.js
diagrams to illustrate
what is going on under the hood:
A consequence of these different paradigms is that htmx
don’t care about
what the server sends back and will happily include in the DOM.
Hence, front-end and back-end are decoupled and less complex.
Whereas in Reactland, we need to have a tight synchronicity between front-end
and back-end. If the JSON that the server sends doesn’t conform to the exact
specifications of the front-end, the application becomes a dumpster fire
breaks.
When the web was created it was based on the concept of Hypermedia. Hypermedia refers to a system of interconnected multimedia elements, which can include text, graphics, audio, video, and hyperlinks. It allows users to navigate between related pieces of content across the web or within applications, creating a non-linear way of accessing information.
HTML follows the Hypermedia protocol. HTML is the native language of browsers^{4}. That’s why all the React-like frameworks have to convert JavaScript into HTML. So it’s only natural to rely primarily on HTML to deliver content and sprinkle JavaScript sparingly when you need something that HTML cannot offer.
Unfortunately, HTML has stopped in time. Despite all the richness of
HTTP with the diverse request methods:
GET
, HEAD
, POST
, PUT
, DELETE
, CONNECT
, OPTIONS
, TRACE
, PATCH
;
HTML only has two elements that interact with the server:
<a>
: sends a GET
request to fetch new data.<form>
: sends a POST
request to create new data.That’s the main purpose of htmx
: allowing HTML elements to leverage all the
capabilities of HTTP.
htmx
in PracticeOK, enough of abstract and theoretical concepts. Let’s see how htmx
works in
practice.
First, the only thing you need to do enable htmx
is to insert this <script>
tag in your HTML:
<script src="https://unpkg.com/htmx.org@{version}"></script>
where {version}
is the desired htmx
version that you’ll want to use.
I has around 40kb of size.
Inside the code behind stoicquotes.io
^{1},
we have the following HTML^{5}:
<div>
<blockquote id="quote">
Some nice Stoic quote...
</blockquote>
</div>
<button hx-get="/quote" hx-trigger="click" hx-target="#quote" hx-swap="outerHTML">
New
</button>
When the user clicks (hx-trigger
) in the “New” button, htmx
sends a GET
request to the /quote
endpoint (hx-get
). Then it swaps the whole HTML
(hx-swap
) of the element that has id “quote” (hx-target
).
This is accomplished without a single character of JavaScript.
Instead we extend HTML by adding new attributes to the HTML elements:
hx-get
hx-trigger
hx-target
hx-swap
The server replies with a new <blockquote>
element every time it gets a GET
request in the /quote
endpoint.
This is truly amazing. We just used one line of htmx
.
htmx
adheres to my trifecta of amazing tools^{6}:
Here’s a breakdown of what the trifecta of amazing tools means:
Powerful: A powerful tool has the capability to handle complex, demanding tasks with relative ease. It possesses the strength, performance, and features necessary to accomplish a wide range of functions.
Expressive: An expressive tool gives users the ability to articulate complex ideas, designs, or concepts with simplicity and nuance. It provides a rich set of capabilities that allow for diverse and sophisticated forms of expression.
Concise: A concise tool allows for achieving goals with minimal effort or complexity. It focuses on efficiency and effectiveness, often through simplification and the removal of unnecessary components. It should be capable of performing tasks without requiring verbose instructions or processes.
Now compare this with React.
First, we need to install React. This is not simple, but here’s a breakdown:
install Node.js
install React: npm install react react-dom
create an index.js
file with some variant of:
import { createRoot } from 'react-dom/client';
document.body.innerHTML = '<div id="app"></div>';
const root = createRoot(document.getElementById('app'));
root.render(<h1>Hello, world</h1>);
And now here’s the code for the Quote
component:
import React, { useState } from 'react';
const Quote = () => {
const [quote, setQuote] = useState('Some nice Stoic quote...');
const fetchNewQuote = async () => {
try {
const response = await fetch('/quote');
const newQuote = await response.text();
setQuote(newQuote);
} catch (error) {
console.error('Error fetching new quote:', error);
}
};
return (
<div>
<blockquote id="quote">
{quote}
</blockquote>
<button onClick={fetchNewQuote}>
New
</button>
</div>
);
}
export default Quote;
That’s a LOT of JavaScript code. The Soy Gods must be smiling upon you, my friend.
I highly recommend that you check out htmx
,
especially the free Hypermedia systems book which
goes into details and it is way more comprehensive than this short blog post.
htmx
is a fresh and elegant approach to build simple reactive web pages.
It extends HTML to be able to use all of the capabilities of any JavaScript-based
reactive framework without a single drop of JavaScript.
You just add some new HTML attributes to your HTML elements.
I’ve had such joy using htmx
lately.
It made me go back into my early teens, when I was doing HTML pages in
GeoCities.
Good times, no JavaScript-bloated code.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
you can find the source code at
storopoli/stoic-quotes
. ↩︎ ↩︎
YES, yes, no YavaScript. Hooray! ↩︎
htmx
can do much more, such as lazy loading, infinite scroll,
or submitting forms without a full page reload, etc. ↩︎
I’ve simplified a bit removing some styling for the purpose of clarity. ↩︎
there are some other tools that I use that adhere to the trifecta. Most notoriously is Julia and Rust. ↩︎
I have an open access and open source^{1} graduate-level course on Bayesian statistics.
It is available in GitHub through the repo storopoli/Bayesian-Statistics
.
I’ve taught it many times and every time was such a joy.
It is composed of:
Now and then I receive emails from someone saying that the materials helped them to understand Bayesian statistics. These kind messages really make my day, and that’s why I strive to keep the content up-to-date and relevant.
I decided to make the repository fully reproducible and testable in CI^{5} using Nix and GitHub actions.
Here’s what I am testing on every new change to the main repository and every new pull request (PR):
All of these tests demand a highly reproducible and intricate development environment. That’s where Nix comes in. Nix can be viewed as a package manager, operating system, build tool, immutable system, and many things.
Nix is purely functional. Everything is described as an expression/function, taking some inputs and producing deterministic outputs. This guarantees reproducible results and makes caching everything easy. Nix expressions are lazy. Anything described in Nix code will only be executed if some other expression needs its results. This is very powerful but somewhat unnatural for developers not familiar with functional programming.
I enjoy Nix so much that I use it as the operating system and package manager in
all of my computers.
Feel free to check my setup at
storopoli/flakes
.
The main essence of the repository setup is the
flake.nix
file.
A Flake is a collection of recipes (Nix derivations) that the repository
provides.
From the NixOS Wiki article on Flakes:
Flakes is a feature of managing Nix packages to simplify usability and improve reproducibility of Nix installations. Flakes manages dependencies between Nix expressions, which are the primary protocols for specifying packages. Flakes implements these protocols in a consistent schema with a common set of policies for managing packages.
I use the Nix’s Flakes to not only setup the main repository package,
defined in the Flake as just package.default
which is the PDF build of the LaTeX slides,
but also to setup the development environment,
defined in the Flake as the devShell.default
,
to run the latest versions of
Stan and Julia/Turing.jl.
We’ll go over the Flake file in detail. However, let me show the full Flake file:
{
description = "A basic flake with a shell";
inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
inputs.flake-utils.url = "github:numtide/flake-utils";
inputs.pre-commit-hooks.url = "github:cachix/pre-commit-hooks.nix";
outputs = { self, nixpkgs, flake-utils, pre-commit-hooks }:
flake-utils.lib.eachDefaultSystem (system:
let
pkgs = nixpkgs.legacyPackages.${system};
tex = pkgs.texlive.combine {
inherit (pkgs.texlive) scheme-small;
inherit (pkgs.texlive) latexmk pgf pgfplots tikzsymbols biblatex beamer;
inherit (pkgs.texlive) silence appendixnumberbeamer fira fontaxes mwe;
inherit (pkgs.texlive) noto csquotes babel helvetic transparent;
inherit (pkgs.texlive) xpatch hyphenat wasysym algorithm2e listings;
inherit (pkgs.texlive) lstbayes ulem subfigure ifoddpage relsize;
inherit (pkgs.texlive) adjustbox media9 ocgx2 biblatex-apa wasy;
};
julia = pkgs.julia-bin.overrideDerivation (oldAttrs: { doInstallCheck = false; });
in
{
checks = {
pre-commit-check = pre-commit-hooks.lib.${system}.run {
src = ./.;
hooks = {
typos.enable = true;
};
};
};
devShells.default = pkgs.mkShell {
packages = with pkgs;[
bashInteractive
# pdfpc # FIXME: broken on darwin
typos
cmdstan
julia
];
shellHook = ''
export JULIA_NUM_THREADS="auto"
export JULIA_PROJECT="turing"
export CMDSTAN_HOME="${pkgs.cmdstan}/opt/cmdstan"
${self.checks.${system}.pre-commit-check.shellHook}
'';
};
packages.default = pkgs.stdenvNoCC.mkDerivation rec {
name = "slides";
src = self;
buildInputs = with pkgs; [
coreutils
tex
gnuplot
biber
];
phases = [ "unpackPhase" "buildPhase" "installPhase" ];
buildPhase = ''
export PATH="${pkgs.lib.makeBinPath buildInputs}";
cd slides
export HOME=$(pwd)
latexmk -pdflatex -shell-escape slides.tex
'';
installPhase = ''
mkdir -p $out
cp slides.pdf $out/
'';
};
});
}
A flake is composed primarily of inputs
and outputs
.
As inputs
I have:
inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
inputs.flake-utils.url = "github:numtide/flake-utils";
inputs.pre-commit-hooks.url = "github:cachix/pre-commit-hooks.nix";
nixpkgs
is responsible for providing all of the packages necessary for both
package.default
and devShell.default
: cmdstan
, julia-bin
, typos
,
and a bunch of texlive
LaTeX small packages.flake-utils
are a bunch of Nix utility functions that creates tons of
syntactic sugar to make the Flake easily accessible in all platforms,
such as macOS and Linux.pre-commit-hooks
is a nice Nix utility to create easy
git hooks
that do some checking at several steps of the git workflow.
The only hook that I am using is the typos
pre-commit hook that checks the whole commit changes for common typos and won’t
let you commit successfully if you have typos:
either correct or whitelist them in the _typos.toml
file.The outputs
are the bulk of the Flake file and it is a Nix function that
takes all the above as inputs and outputs a couple of things:
outputs = { self, nixpkgs, flake-utils, pre-commit-hooks }:
flake-utils.lib.eachDefaultSystem (system: {
checks = ...
devShells = ...
packages = ...
});
checks
things that are executed/built when you run nix flake check
devShells
things that are executed/built when you run nix develop
packages
things that are executed/built when you run nix build
Let’s go over each one of the outputs that the repository Flake has.
packages
– LaTeX slidesWe all know that LaTeX is a pain to make it work.
If it builds in my machine definitely won’t build in yours.
This is solved effortlessly in Nix.
Take a look at the tex
variable definition in the let ... in
block:
let
# ...
tex = pkgs.texlive.combine {
inherit (pkgs.texlive) scheme-small;
inherit (pkgs.texlive) latexmk pgf pgfplots tikzsymbols biblatex beamer;
inherit (pkgs.texlive) silence appendixnumberbeamer fira fontaxes mwe;
inherit (pkgs.texlive) noto csquotes babel helvetic transparent;
inherit (pkgs.texlive) xpatch hyphenat wasysym algorithm2e listings;
inherit (pkgs.texlive) lstbayes ulem subfigure ifoddpage relsize;
inherit (pkgs.texlive) adjustbox media9 ocgx2 biblatex-apa wasy;
};
# ...
in
tex
is a custom instantiation of the texlive.combine
derivation with some
overrides to specify which CTAN packages you need to build the slides.
We use tex
in the packages.default
Flake output
:
packages.default = pkgs.stdenvNoCC.mkDerivation rec {
name = "slides";
src = self;
buildInputs = with pkgs; [
coreutils
tex
gnuplot
biber
];
phases = [ "unpackPhase" "buildPhase" "installPhase" ];
buildPhase = ''
export PATH="${pkgs.lib.makeBinPath buildInputs}";
cd slides
export HOME=$(pwd)
latexmk -pdflatex -shell-escape slides.tex
'';
installPhase = ''
mkdir -p $out
cp slides.pdf $out/
'';
};
Here we are declaring a Nix derivation with the stdenvNoCC.mkDerivation
,
the NoCC
part means that we don’t need C/C++ build tools.
The src
is the Flake repository itself and I also specify the dependencies
in buildInputs
: I still need some fancy stuff to build my slides.
Finally, I specify the several phases
of the derivation.
The most important part is that I cd
into the slides/
directory
and run latexmk
in it, and copy the resulting PDF to the $out
Nix
special directory which serves as the output directory for the derivation.
This is really nice because anyone with Nix installed can run:
nix build github:storopoli/Bayesian-Statistics
and bingo! You have my slides as PDF built from LaTeX files without having to clone or download the repository. Fully reproducible in any machine or architecture.
The next step is to configure GitHub actions to run Nix and build the slides'
PDF file in CI.
I have two workflows for that and they are almost identical except for the
last step.
The first one is the
build-slides.yml
,
which, of course, builds the slides.
These are the relevant parts:
name: Build Slides
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v8
- name: Build Slides
run: nix build -L
- name: Copy result out of nix store
run: cp -v result/slides.pdf slides.pdf
- name: Upload Artifacts
uses: actions/upload-artifact@v3
with:
name: output
path: ./slides.pdf
if-no-files-found: error
Here we use a set of actions to:
nix build
(the -L
flag is to have more verbose logs)The last one is the
release-slides.yml
,
which releases the slides when I publish a new tag.
It is almost the same as build-slides.yml
, thus I will only highlight the
relevant bits:
on:
push:
tags:
- "*"
# ...
- name: Release
uses: ncipollo/release-action@v1
id: release
with:
artifacts: ./slides.pdf
The only change is the final step that we now use a release-action
that automatically publishes a release with the slides’ PDF file as one of the
release artifacts.
This is good since, once I achieve a milestone in the slides,
I can easily tag a new version and have GitHub automatically publish a new
release with the resulting PDF file attached in the release.
This is a very good workflow, both in GitHub but also locally.
I don’t need to install tons of gigabytes of texlive stuff to build my slides
locally.
I just run nix build
.
Also, if someones contributes to the slides I don’t need to check the correctness
of the LaTeX code, only the content and the output PDF artifact in the
resulting CI from the PR.
If it’s all good, just thank the blessed soul and merge it!
The repository has a directory called turing/
which is a Julia project with
.jl
files and a Project.toml
that lists the Julia dependencies and
appropriate compat
bounds.
In order to test the Turing.jl models in the Julia files,
I have the following things in the Nix Flake devShell
:
let
# ...
julia = pkgs.julia-bin.overrideDerivation (oldAttrs: { doInstallCheck = false; });
# ...
in
# ...
devShells.default = pkgs.mkShell {
packages = with pkgs;[
# ...
julia
# ...
];
shellHook = ''
# ...
export JULIA_NUM_THREADS="auto"
export JULIA_PROJECT="turing"
# ...
'';
};
Nix devShell
lets you create a development environment by adding a
transparent layer on top of your standard shell environment with additional
packages, hooks, and environment variables.
First, in the let ... in
block, I am defining a variable called julia
that is the julia-bin
package with an attribute doInstallCheck
being overridden to false
.
I don’t want the Nix derivation of the mkShell
to run all Julia standard tests.
Next, I define some environment variables in the shellHook
,
which, as the name implies, runs every time that I instantiate the default
devShell
with nix develop
.
With the Nix Flake part covered, let’s check how we wrap everything in a
GitHub action workflow file named
models.yml
.
Again, I will only highlight the relevant parts for the Turing.jl model testing
CI job:
jobs:
test-turing:
name: Test Turing Models
runs-on: ubuntu-latest
strategy:
matrix:
jl-file: [
"01-predictive_checks.jl",
# ...
"13-model_comparison-roaches.jl",
]
steps:
# ...
- name: Test ${{ matrix.jl-file }}
run: |
nix develop -L . --command bash -c "julia -e 'using Pkg; Pkg.instantiate()'"
nix develop -L . --command bash -c "julia turing/${{ matrix.jl-file }}"
I list all the Turing.jl model Julia files in a matrix.jl-file
list
to
define variations for each job.
Next, we install the latest Julia version.
Finally, we run everything in parallel using the YAML string interpolation
${{ matrix.jl-file }}
.
This expands the expression into N
parallel jobs,
where N
is the jl-file
list length.
If any of these parallel jobs error out, then the whole workflow will error.
Hence, we are always certain that the models are up-to-date with the latest Julia
version in nixpkgs
, and the latest Turing.jl dependencies.
The repository has a directory called stan/
that holds a bunch of Stan models
in .stan
files.
These models can be used with any Stan interface,
such as
RStan
/CmdStanR
,
PyStan
/CmdStanPy
,
or Stan.jl
.
However I am using CmdStan
which only needs a shell environment and Stan, no additional dependencies
like Python, R, or Julia.
Additionally, nixpkgs
has a
cmdstan
package that is well-maintained and up-to-date with the latest Stan release.
In order to test the Stan models,
I have the following setup in the Nix Flake devShell
:
devShells.default = pkgs.mkShell {
packages = with pkgs;[
# ...
cmdstan
# ...
];
shellHook = ''
# ...
export CMDSTAN_HOME="${pkgs.cmdstan}/opt/cmdstan"
# ...
'';
};
Here I am also defining an environment variable in the shellHook
,
CMDSTAN_HOME
because that is useful for local development.
In the same GitHub action workflow
models.yml
file is defined the Stan model testing CI job:
jobs:
test-stan:
name: Test Stan Models
runs-on: ubuntu-latest
strategy:
matrix:
stan: [
{
model: "01-predictive_checks-posterior",
data: "coin_flip.data.json",
},
# ...
{
model: "13-model_comparison-zero_inflated-poisson",
data: "roaches.data.json",
},
]
steps:
# ...
- name: Test ${{ matrix.stan.model }}
run: |
echo "Compiling: ${{ matrix.stan.model }}"
nix develop -L . --command bash -c "stan stan/${{ matrix.stan.model }}"
nix develop -L . --command bash -c "stan/${{ matrix.stan.model }} sample data file=stan/${{ matrix.stan.data }}"
Now I am using a YAML dictionary as the entry for every element in the stan
YAML list with two keys: model
and data
.
model
lists the Stan model file without the .stan
extension,
and data
lists the JSON data file that the model needs to run.
We’ll use both to run parallel jobs to test all the Stan models listed in the
stan
list.
For that we use the following commands:
nix develop -L . --command bash -c "stan stan/${{ matrix.stan.model }}"
nix develop -L . --command bash -c "stan/${{ matrix.stan.model }} sample data file=stan/${{ matrix.stan.data }}"
This instantiates the devShell.default
shell environment,
and uses the stan
binary provided by the cmdstan
Nix package to compile the
model into an executable binary.
Next, we run this model executable binary in sample
mode while also providing
the corresponding data file with data file=
.
As before, if any of these parallel jobs error out, then the whole workflow will
error.
Hence, we are always certain that the models are up-to-date with the latest
Stan/CmdStan version in nixpkgs
.
I am quite happy with this setup. It makes easy to run test in CI with GitHub Actions, while also being effortless to instantiate a development environment with Nix. If I want to get a new computer up and running, I don’t need to install a bunch of packages and go over “getting started” instructions to have all the necessary dependencies.
This setup also helps onboard new contributors since it is:
Speaking of “contributors”, if you are interested in Bayesian modeling,
feel free to go over the contents of the repository
storopoli/Bayesian-Statistics
.
Contributions are most welcomed.
Don’t hesitate on opening an issue or pull request.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
the code is MIT-licensed and the content is CreativeCommons Non-Commercial 4.0 ↩︎
I am also planning to go over the slides for every lecture in a YouTube playlist in the near future. This would make it the experience complete: slides, lectures, and code. ↩︎
a probabilistic programming language and suite of MCMC samplers written in C++. It is today’s gold standard in Bayesian stats. ↩︎
is an ecosystem of Julia packages for Bayesian inference using probabilistic programming. ↩︎
CI stands for continuous integration, sometimes also known as CI/CD, continuous integration and continuous delivery. CI/CD is a wide “umbrella” term for “everything that is tested in all parts of the development cicle”, and these tests commonly take place in a cloud machine. ↩︎
Zero-cost abstractions allows you to write performant code without having to give up a single drop of convenience and expressiveness:
You want for-loops? You can have it. Generics? Yeah, why not? Data structures? Sure, keep’em coming. Async operations? You bet ya! Multi-threading? Hell yes!
To put more formally, I like this definition from StackOverflow:
Zero Cost Abstractions means adding higher-level programming concepts, like generics, collections and so on do not come with a run-time cost, only compile time cost (the code will be slower to compile). Any operation on zero-cost abstractions is as fast as you would write out matching functionality by hand using lower-level programming concepts like for loops, counters, ifs and using raw pointers.
Here’s an analogy:
Imagine that you are going to buy a car. The sales person offers you a fancy car praising how easy it is to drive it, that you don’t need to think about RPM, clutch and stick shift, parking maneuver, fuel type, and other shenanigans. You just turn it on and drive. However, once you take a look at the car’s data sheet, you are horrified. The car is bad in every aspect except easy of use. It has dreadful fuel consumption, atrocious safety ratings, disastrous handling, and so on…
Believe me, you wouldn’t want to own that car.
Metaphors aside, that’s exactly what professional developers^{1} and whole teams choose to use every day: unacceptable inferior tools. Tools that, not only don’t have zero-cost abstractions, rather don’t allow you to even have non-zero-cost anything!
Let’s do some Python bashing in the meantime. I know that’s easy to bash Python, but that’s not the point. If Python wasn’t used so widely in production, I would definitely leave it alone. Don’t get me wrong, Python is the second-best language for everything^{2}.
I wish this meme was a joke, but it isn’t. A boolean is one of the simplest data type taking only two possible values: true or false. Just grab your nearest Python REPL:
>>> from sys import getsizeof
>>> getsizeof(True)
28
The function sys.getsizeof
returns the size of an object in bytes.
How the hell Python needs 28 bytes to represent something that needs at most 1 byte^{3}?
Imagine incurring a 28x penalty in memory size requirements for every boolean
that you use.
Now multiply this by every operation that your code is going to run in production
over time.
Again: unacceptable.
That’s because all objects in Python,
in the sense that everything that you can instantiate,
i.e. everything that you can put on the left hand-side of the =
assignment,
is a PyObject
:
All Python objects ultimately share a small number of fields at the beginning of the object’s representation in memory. These are represented by the
PyObject
andPyVarObject
types.
Python is dynamically-typed, which means that you don’t have primitives like 8-, 16-, 32-bit (un)signed integers and so on. Everything is a huge mess allocated in the heap that must carry not only its value, but also information about its type.
Most important, everything that is fast in Python is not Python-based. Take a look at the image below, I grabbed some popular Python libraries from GitHub, namely NumPy (linear algebra package) and PyToch (deep learning package), and checked the language codebase percentage.
Surprise, they are not Python libraries. They are C/C++ codebases. Even if Python is the main language used in these codebases^{4}, I still think that this is not the case due to the nature of the Python code: all docstrings are written in Python. If you have a very fast C function in your codebase that takes 50 lines of code, followed by a Python wrapper function that calls it using 10 lines of code, but with a docstring that is 50 lines of code; you have a “Python”-majority codebase.
In a sense the most efficient Python programmer is a C/C++ programmer…
Here’s Julia, which is also dynamically-typed:
julia> Base.summarysize(true)
1
And to your surprise,
Julia is coded in …. Julia!
Check the image below for the language codebase percentage of
Julia
and Lux.jl
^{5} (deep learning package).
Finally, here’s Rust, which is not dynamically-, but static-typed:
// main.rs
use std::mem;
fn main() {
println!("Size of bool: {} byte", mem::size_of::<bool>());
}
$ cargo run --release
Compiling size_of_bool v0.1.0
Finished release [optimized] target(s) in 0.00s
Running `target/release/size_of_bool`
Size of bool: 1 byte
Let’s cover two more zero-costs abstractions, both in Julia and in Rust: for-loops and enums.
A friend and a Julia-advocate once told me that Julia’s master plan is to secretly “make everyone aware about compilers”. The compiler is a program that translate source code from a high-level programming language to a low-level programming language (e.g. assembly language, object code, or machine code) to create an executable program.
Python uses CPython as the compiler. If you search around on why CPython/Python is so slow and inefficient, you’ll find that the culprits are:
I completely disagree with almost all the above reasons, except the GIL. Python is slow because of its design decisions, more specifically the way CPython works under the hood. It is not built for performance in mind. Actually, the main objective of Python was to be a “language that would be easy to read, write, and maintain”. I salute that: Python has remained true to its main objective.
Now let’s switch to Julia:
I’ve copy-pasted all Python’s arguments for inefficiency, except the GIL. And, contrary to Python, Julia is fast! Sometimes even faster than C^{6}. Actually, that was the goal all along since Julia’s inception. If you check the notorious Julia announcement blog post from 2012:
We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
(Did we mention it should be as fast as C?)
It mentions “speed” twice. Not only that, but also specifically says that it should match C’s speed.
Julia is fast because of its design decisions. One of the major reasons why Julia is fast is because of the choice of compiler that it uses: LLVM.
LLVM originally stood for low level virtual machine. Despite its name, LLVM has little to do with traditional virtual machines. LLVM can take intermediate representation (IR) code and compile it into machine-dependent instructions. It has support and sponsorship from a lot of big-tech corporations, such as Apple, Google, IBM, Meta, Arm, Intel, AMD, Nvidia, and so on. It is a pretty fast compiler that can do wonders in optimizing IR code to a plethora of computer architectures.
In a sense, Julia is a front-end for LLVM. It turns your easy-to-read and easy-to-write Julia code into LLVM IR code. Take this for-loop example inside a function:
function sum_10()
acc = 0
for i in 1:10
acc += i
end
return acc
end
Let’s check what Julia generates as LLVM IR code for this function.
We can do that with the @code_llvm
macro.
julia> @code_llvm debuginfo=:none sum_10()
define i64 @julia_sum_10_172() #0 {
top:
ret i64 55
}
You can’t easily fool the compiler. Julia understands that the answer is 55, and the LLVM IR generated code is pretty much just “return 55 as a 64-bit integer”.
Let’s also check the machine-dependent instructions with the @code_native
macro.
I am using an Apple Silicon machine, so these instructions might differ from yours:
julia> @code_native debuginfo=:none sum_10()
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl _julia_sum_10_214 ; -- Begin function julia_sum_10_214
.p2align 2
_julia_sum_10_214: ; @julia_sum_10_214
.cfi_startproc
; %bb.0: ; %top
mov w0, #55
ret
.cfi_endproc
; -- End function
.subsections_via_symbols
The only important instruction for our argument here is the mov w0, #55
.
This means “move the value 55 into the w0
register”,
where w0
is one of registers available in ARM-based architectures
(which Apple Silicon chips are).
This is a zero-cost abstraction! I don’t need to give up for-loops, because they might be slow and inefficient; like some Python users suggest newcomers. I can have the full convenience and expressiveness of for-loops without paying performance costs. Pretty much the definition of a zero-cost abstraction from above.
Using LLVM as a compiler backend is not something unique to Julia. Rust also uses LLVM under the hood. Take for example this simple Rust code:
// main.rs
pub fn sum_10() -> i32 {
let mut acc = 0;
for i in 1..=10 {
acc += i
}
acc
}
fn main() {
println!("sum_10: {}", sum_10());
}
We can inspect both LLVM IR code and machine instructions with the
cargo-show-asm
crate:
$ cargo asm --llvm "sum_10::main" | grep 55
Finished release [optimized] target(s) in 0.00s
store i32 55, ptr %_9, align 4
$ cargo asm "sum_10::main" | grep 55
Finished release [optimized] target(s) in 0.00s
mov w8, #55
No coincidence that the LLVM IR code is very similar,
with the difference that integers, by default,
in Julia are 64 bits and in Rust 32 bits.
However, the machine code is identical:
“move the value 55 into a w
something register”.
Another zero-cost abstraction, in Julia and Rust, is enums.
In Julia all enums, by default have a BaseType
of Int32
:
a signed 32-bit integer.
However, we can override this with type annotations:
julia> @enum Thing::Bool One Two
julia> Base.summarysize(Thing(false))
1
Here we have an enum Thing
with two variants: One
and Two
.
Since we can safely represent all the possible variant space of Thing
with a boolean type, we override the BaseType
of Thing
to be the Bool
type.
Unsurprised, any object of Thing
occupies 1 byte in memory.
We can achieve the same with Rust:
// main.rs
use std::mem;
#[allow(dead_code)]
enum Thing {
One,
Two,
}
fn main() {
println!("Size of Thing: {} byte", mem::size_of::<Thing>());
}
$ cargo run --release
Compiling enum_size v0.1.0
Finished release [optimized] target(s) in 0.09s
Running `target/release/enum_size`
Size of Thing: 1 byte
However, contrary to Julia, Rust compiler automatically detects the enum’s variant space size and adjust accordingly. So, no need of overrides.
Zero-cost abstractions are a joy to have in a programming language. It enables you, as a programmer, to just focus on what’s important: write expressive code that is easy to read, maintain, debug, and build upon.
It is no wonder that zero-cost abstractions is a pervasive feature of two of my top-favorite languages: Julia and Rust.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
this post is somehow connected to my soydev rant. ↩︎
and that’s not a compliment. ↩︎
technically, we can represent a boolean with just one bit. However, the short answer is still one byte, because that’s smallest addressable unit of memory. ↩︎
and modifying .gitattributes
is cheating.
Yes, I am talking to you NumPy! ↩︎
if you compare runtime execution. ↩︎
Warning: This post has KaTeX enabled, so if you want to view the rendered math formulas, you’ll have to unfortunately enable JavaScript.
Dennis Lindley, one of my many heroes, was an English statistician, decision theorist and leading advocate of Bayesian statistics. He published a pivotal book, Understanding Uncertainty, that changed my view on what is and how to handle uncertainty in a coherent^{1} way. He is responsible for one of my favorites quotes: “Inside every non-Bayesian there is a Bayesian struggling to get out”; and one of my favorite heuristics around prior probabilities: Cromwell’s Rule^{2}. Lindley predicted in 1975 that “Bayesian methods will indeed become pervasive, enabled by the development of powerful computing facilities” (Lindley, 1975). You can find more about all of Lindley’s achievements in his obituary.
Lindley’s paradox^{3} is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution.
More formally, the paradox is as follows. We have some parameter $\theta$ that we are interested in. Then, we proceed with an experiment to test two competing hypotheses:
The paradox occurs when two conditions are met:
These results can occur at the same time when $H_0$ is very specific, $H_a$ more diffuse, and the prior distribution does not strongly favor one or the other. These conditions are pervasive across science and common in traditional null-hypothesis significance testing approaches.
This is a duel of frequentist versus Bayesian approaches, and one of the many in which Bayesian emerges as the most coherent. Let’s give a example and go over the analytical result with a ton of math, but also a computational result with Julia.
Here’s the setup for the example. In a certain city 49,581 boys and 48,870 girls have been born over a certain time period. The observed proportion of male births is thus $\frac{49,581}{98,451} \approx 0.5036$.
We assume that the birth of a child is independent with a certain probability $\theta$. Since our data is a sequence of $n$ independent Bernoulli trials, i.e., $n$ independent random experiments with exactly two possible outcomes: “success” and “failure”, in which the probability of success is the same every time the experiment is conducted. We can safely assume that it follows a binomial distribution with parameters:
We then set up our two competing hypotheses:
This is a toy-problem and, like most toy problems, we can solve it analytically^{5} for both the frequentist and the Bayesian approaches.
The frequentist approach to testing $H_0$ is to compute a $p$-value^{4}, the probability of observing births of boys at least as large as 49,581 assuming $H_0$ is true. Because the number of births is very large, we can use a normal approximation^{6} for the binomial-distributed number of male births. Let’s define $X$ as the total number of male births, then $X$ follows a normal distribution:
$$X \sim \text{Normal}(\mu, \sigma)$$
where $\mu$ is the mean parameter, $n \theta$ in our case, and $\sigma$ is the standard deviation parameter, $\sqrt{n \theta (1 - \theta)}$. We need to calculate the conditional probability of $X \geq \frac{49,581}{98,451} \approx 0.5036$ given $\mu = n \theta = 98,451 \cdot \frac{1}{2} = 49,225.5$ and $\sigma = \sqrt{n \theta (1 - \theta)} = \sqrt{98,451 \cdot \frac{1}{2} \cdot (1 - \frac{1}{2})}$:
$$P(X \ge 0.5036 \mid \mu = 49,225.5, \sigma = \sqrt{24.612.75})$$
This is basically a cumulative distribution function (CDF) of $X$ on the interval $[49,225.5, 98,451]$:
$$\int_{49,225.5}^{98,451} \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{\left( \frac{x - \mu}{\sigma} \right)^2}{2}} dx$$
After inserting the values and doing some arithmetic, our answer is approximately $0.0117$. Note that this is a one-sided test, since it is symmetrical, the two-sided test would be $0.0117 \cdot 2 = 0.0235$. Since we don’t deviate from the Fisher’s canon, this is well below the 5% threshold. Hooray! We rejected the null hypothesis! Quick! Grab a frequentist celebratory cigar! But, wait. Let’s check the Bayesian approach.
For the Bayesian approach, we need to set prior probabilities on both hypotheses. Since we do not favor one from another, let’s set equal prior probabilities:
$$P(H_0) = P(H_a) = \frac{1}{2}$$
Additionally, all parameters of interest need a prior distribution. So, let’s put a prior distribution on $\theta$. We could be fancy here, but let’s not. We’ll use a uniform distribution on $[0, 1]$.
We have everything we need to compute the posterior probability of $H_0$ given $\theta$. For this, we’ll use Bayes theorem^{7}:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$
Now again let’s plug in all the values:
$$P(H_0 \mid \theta) = \frac{P(\theta \mid H_0) P(H_0)}{P(\theta)}$$
Note that by the axioms of probability and by the product rule of probability we can decompose $P(\theta)$ into:
$$P(\theta) = P(\theta \mid H_0) P(H_0) + P(\theta \mid H_a) P(H_a)$$
Again, we’ll use the normal approximation:
$$ \begin{aligned} &P \left( \theta = 0.5 \mid \mu = 49,225.5, \sigma = \sqrt{24.612.75} \right) \\ &= \frac{ \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \left( \frac{(\mu - \mu \cdot 0.5)}{2 \sigma} \right)^2} \cdot 0.5 } { \frac{1}{\sqrt{2 \pi \sigma^2}} e^{ \left( -\frac{(\mu - \mu \cdot 0.5)}{2 \sigma} \right)^2} \cdot 0.5 + \int_0^1 \frac {1}{\sqrt{2 \pi \sigma^2} } e^{- \left( \frac{\mu - \mu \cdot \theta)}{2 \sigma} \right)^2}d \theta \cdot 0.5 } \\ &= 0.9505 \end{aligned} $$
The likelihood of the alternative hypothesis, $P(\theta \mid H_a)$, is just the CDF of all possible values of $\theta \ne 0.5$.
$$P(H_0 \mid \text{data}) = P \left( \theta = 0.5 \mid \mu = 49,225.5, \sigma = \sqrt{24.612.75} \right) > 0.95$$
And we fail to reject the null hypothesis, in frequentist terms. However, we can also say in Bayesian terms, that we strongly favor $H_0$ over $H_a$.
Quick! Grab the Bayesian celebratory cigar! The null is back on the game!
For the computational solution, we’ll use Julia and the following packages:
We can perform a BinomialTest
with HypothesisTest.jl
:
julia> using HypothesisTests
julia> BinomialTest(49_225, 98_451, 0.5036)
Binomial test
-------------
Population details:
parameter of interest: Probability of success
value under h_0: 0.5036
point estimate: 0.499995
95% confidence interval: (0.4969, 0.5031)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0239
Details:
number of observations: 98451
number of successes: 49225
This is the two-sided test,
and I had to round $49,225.5$ to $49,225$
since BinomialTest
do not support real numbers.
But the results match with the analytical solution,
we still reject the null.
Now, for the Bayesian computational approach,
I’m going to use a generative modeling approach,
and one of my favorites probabilistic programming languages,
Turing.jl
:
julia> using Turing
julia> @model function birth_rate()
θ ~ Uniform(0, 1)
total_births = 98_451
male_births ~ Binomial(total_births, θ)
end;
julia> model = birth_rate() | (; male_births = 49_225);
julia> chain = sample(model, NUTS(1_000, 0.8), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×13×4 Array{Float64, 3}):
Iterations = 1001:1:2000
Number of chains = 4
Samples per chain = 1000
Wall duration = 0.2 seconds
Compute duration = 0.19 seconds
parameters = θ
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
θ 0.4999 0.0016 0.0000 1422.2028 2198.1987 1.0057 7368.9267
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
θ 0.4969 0.4988 0.4999 0.5011 0.5031
We can see from the output of the quantiles that the 95% quantile for $\theta$ is the interval $(0.4969, 0.5031)$. Although it overlaps zero, that is not the equivalent of a hypothesis test. For that, we’ll use the highest posterior density interval (HPDI), which is defined as “choosing the narrowest interval” that captures a certain posterior density threshold value. In this case, we’ll use a threshold interval of 95%, i.e. an $\alpha = 0.05$:
julia> hpd(chain; alpha=0.05)
HPD
parameters lower upper
Symbol Float64 Float64
θ 0.4970 0.5031
We see that we fail to reject the null, $\theta = 0.5$ at $\alpha = 0.05$ which is in accordance with the analytical solution.
Why do the approaches disagree? What is going on under the hood?
The answer is disappointing^{8}. The main problem is that the frequentist approach only allows fixed significance levels with respect to sample size. Whereas the Bayesian approach is consistent and robust to sample size variations.
Taken to extreme, in some cases, due to huge sample sizes, the $p$-value is pretty much a proxy for sample size and have little to no utility on hypothesis testing. This is known as $p$-hacking^{9}.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Lindley, Dennis V. “The future of statistics: A Bayesian 21st century”. Advances in Applied Probability 7 (1975): 106-115.
as far as I know there’s only one coherent approach to uncertainty, and it is the Bayesian approach. Otherwise, as de Finetti and Ramsey proposed, you are susceptible to a Dutch book. This is a topic for another blog post… ↩︎
Cromwell’s rule states that the use of prior probabilities of 1 (“the event will definitely occur”) or 0 (“the event will definitely not occur”) should be avoided, except when applied to statements that are logically true or false. Hence, anything that is not a math theorem should have priors in $(0,1)$. The reference comes from Oliver Cromwell, asking, very politely, for the Church of Scotland to consider that their prior probability might be wrong. This footnote also deserves a whole blog post… ↩︎
Stigler’s law of eponymy states that no scientific discovery is named after its original discoverer. The paradox was already was discussed in Harold Jeffreys' 1939 textbook. Also, fun fact, Stigler’s is not the original creator of such law… Now that’s a self-referential paradox, and a broad version of the Halting problem, which should earn its own footnote. Nevertheless, we are getting into self-referential danger zone here with footnotes’ of footnotes’ of footnotes’… ↩︎
this is called $p$-value and can be easily defined as “the probability of sampling data from a target population given that $H_0$ is true as the number of sampling procedures $\to \infty$”. Yes, it is not that intuitive, and it deserves not a blog post, but a full curriculum to hammer it home. ↩︎ ↩︎
that is not true for most of the real-world problems. For Bayesian approaches, we need to run computational asymptotic exact approximations using a class of methods called Markov chain Monte Carlo (MCMC). Furthermore, for some nasty problems we need to use different set of methods called variational inference (VI) or approximate Bayesian computation (ABC). ↩︎
if you are curious about how this approximation works, check the backup slides of my open access and open source graduate course on Bayesian statistics. ↩︎
Bayes’ theorem is officially called Bayes-Price-Laplace theorem. Bayes was trying to disprove David Hume’s argument that miracles did not exist (How dare he?). He used the probabilistic approach of trying to quantify the probability of a parameter (god exists) given data (miracles happened). He died without publishing any of his ideas. His wife probably freaked out when she saw the huge pile of notes that he had and called his buddy Richard Price to figure out what to do with it. Price struck gold and immediately noticed the relevance of Bayes’ findings. He read it aloud at the Royal Society. Later, Pierre-Simon Laplace, unbeknownst to the work of Bayes, used the same probabilistic approach to perform statistical inference using France’s first census data in the early-Napoleonic era. Somehow we had the answer to statistical inference back then, and we had to rediscover everything again in the late-20th century… ↩︎
disappointing because most of published scientific studies suffer from this flaw. ↩︎
and, like all footnotes here, it deserves its own blog post… ↩︎
Warning: This post has KaTeX enabled, so if you want to view the rendered math formulas, you’ll have to unfortunately enable JavaScript.
I wish I could go back in time and tell my younger self that you can make a machine understand human language with trigonometry. That would definitely have made me more aware and interested in the subject during my school years. I would have looked at triangles, circles, sines, cosines, and tangents in a whole different way. Alas, better late than never.
In this post, we’ll learn how to represent words using word embeddings, and how to use basic trigonometry to play around with them. Of course, we’ll use Julia.
Word embeddings is a way to represent words as a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.
Ok, let’s unwrap the above definition. First, a real-valued vector is any vector which its elements belong to the real numbers. Generally we denote vectors with a bold lower-case letter, and we denote its elements (also called components) using square brackets. Hence, a vector $\bold{v}$ that has 3 elements, $1$, $2$, and $3$, can be written as
$$\bold{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$$
Next, what “close” means for vectors? We can use distance functions to get a measurable value. The most famous and commonly used distance function is the Euclidean distance, in honor of Euclid, the “father of geometry”, and the guy pictured in the image at the top of this post. The Euclidean distance is defined in trigonometry for 2-D and 3-D spaces. However, it can be generalized to any dimension $n > 1$ by using vectors.
Since every word is represented by an $n$-dimensional vector, we can use distances to compute a metric that represent similarity between vectors. And, more interesting, we can add and subtract words (or any other linear combination of one or more words) to generate new words.
Before we jump to code and examples, a quick note about how word embeddings are constructed. They are trained like a regular machine learning algorithm, where the cost function measures the difference between some vector distance between the vectors and a “semantic distance”. The goal is to iteratively find good vector values that minimize the cost. So, if a vector is close to another vector measured by a distance function, but far apart measured by some semantic distance on the words that these vectors represent, then the cost function will be higher. The algorithm cannot change the semantic distance, it is treated as a fixed value. However, it can change the vector elements’ values so that the vector distance function closely resembles the semantic distance function. Lastly, generally the dimensionality of the vectors used in word embeddings are high, $n > 50$, since it needs a proper amount of dimensions in order to represent all the semantic information of words with vectors.
Generally we don’t train our own word embeddings from scratch, we use pre-trained ones. Here is a list of some of the most popular ones:
We will use the Embeddings.jl
package to easily load word embeddings as vectors,
and the Distances.jl
package for the convenience of several distance functions.
This is a nice example of the Julia package ecosystem composability,
where one package can define types, another can define functions,
and another can define custom behavior of these functions on types that
are defined in other packages.
julia> using Embeddings
julia> using Distances
Let’s load the GloVe word embeddings. First, let’s check what we have in store to choose from GloVe’s English language embeddings:
julia> language_files(GloVe{:en})
20-element Vector{String}:
"glove.6B/glove.6B.50d.txt"
"glove.6B/glove.6B.100d.txt"
"glove.6B/glove.6B.200d.txt"
"glove.6B/glove.6B.300d.txt"
"glove.42B.300d/glove.42B.300d.txt"
"glove.840B.300d/glove.840B.300d.txt"
"glove.twitter.27B/glove.twitter.27B.25d.txt"
"glove.twitter.27B/glove.twitter.27B.50d.txt"
"glove.twitter.27B/glove.twitter.27B.100d.txt"
"glove.twitter.27B/glove.twitter.27B.200d.txt"
"glove.6B/glove.6B.50d.txt"
"glove.6B/glove.6B.100d.txt"
"glove.6B/glove.6B.200d.txt"
"glove.6B/glove.6B.300d.txt"
"glove.42B.300d/glove.42B.300d.txt"
"glove.840B.300d/glove.840B.300d.txt"
"glove.twitter.27B/glove.twitter.27B.25d.txt"
"glove.twitter.27B/glove.twitter.27B.50d.txt"
"glove.twitter.27B/glove.twitter.27B.100d.txt"
"glove.twitter.27B/glove.twitter.27B.200d.txt"
I’ll use the "glove.6B/glove.6B.50d.txt"
.
This means that it was trained with 6 billion tokens,
and it provides embeddings with 50-dimensional vectors.
The load_embeddings
function takes an optional second positional
argument as an Int
to choose from which index of the language_files
to use.
Finally, I just want the words “king”, “queen”, “man”, “woman”;
so I am passing these words as a Set
to the keep_words
keyword argument:
julia> const glove = load_embeddings(GloVe{:en}, 1; keep_words=Set(["king", "queen", "man", "woman"]));
Embeddings.EmbeddingTable{Matrix{Float32}, Vector{String}}(Float32[-0.094386 0.50451 -0.18153 0.37854; 0.43007 0.68607 0.64827 1.8233; … ; 0.53135 -0.64426 0.48764 0.0092753; -0.11725 -0.51042 -0.10467 -0.60284], ["man", "king", "woman", "queen"])
Watch out with the order that we get back.
If you see the output of load_embeddings
,
the order is "man", "king", "woman", "queen"]
Let’s see how a word is represented:
julia> queen = glove.embeddings[:, 4]
50-element Vector{Float32}:
0.37854
1.8233
-1.2648
⋮
-2.2839
0.0092753
-0.60284
They are 50-dimensional vectors of Float32
.
Now, here’s the fun part: let’s add words and check the similarity between the result and some other word. A classical example is to start with the word “king”, subtract the word “men”, add the word “woman”, and check the distance of the result to the word “queen”:
julia> man = glove.embeddings[:, 1];
julia> king = glove.embeddings[:, 2];
julia> woman = glove.embeddings[:, 3];
julia> cosine_dist(king - man + woman, queen)
0.13904202f0
This is less than 1/4 of the distance of “woman” to “king”:
julia> cosine_dist(woman, king)
0.58866215f0
Feel free to play around with others words. If you want suggestions, another classical example is:
cosine_dist(Madrid - Spain + France, Paris)
I think that by allying interesting applications to abstract math topics like trigonometry is the vital missing piece in STEM education. I wish every new kid that is learning math could have the opportunity to contemplate how new and exciting technologies have some amazing simple math under the hood. If you liked this post, you would probably like linear algebra. I would highly recommend Gilbert Strang’s books and 3blue1brown series on linear algebra.
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
]]>Let’s dive into the concept of “soydev”, a term often used pejoratively to describe developers with a superficial understanding of technology. I provide my definition of what soydev is, why is bad, and how it came to be. To counteract soydev inclinations, I propose an abstract approach centered on timeless concepts, protocols, and first principles, fostering a mindset of exploration, resilience in the face of failure, and an insatiable hunger for knowledge.
While we’ll start with a look at the soydev stereotype, our journey will lead us to a wider reflection on the importance of depth in technological understanding.
First, let’s tackle the definition of soydev. Urban Dictionary provides two interesting definitions:
Urban Dictionary definition 1:
Soydev is a “programmer” that works at a bigh tech company and only knows JavaScript and HTML. They love IDEs like Visual Studio Code and inefficient frameworks that slow their code down. They represent the majority of “programmers” today and if their numbers continue growing, not one person on earth will know how a computer works by the year 2050 when all the gigachad 1980s C and Unix programmers are gone.
Urban Dictionary definition 2:
Soydev is a type of most abundant Software Developer. The Software he/she makes is always inefficient and uses more CPU and RAM than it should. This person always prefers hard work to smart work, Has little or no knowledge of existing solutions of a problem, Comes up with very complex solution for a simple problem and has fear of native and fast programming languages like C, C++ and Rust
These definitions give a glimpse of what a soydev is. However, they are loaded with pejorative language, and also are based on non-timeless technologies and tools. I, much prefer to rely on concepts and principles that are timeless. Hence, I will provide my own definition of soydev:
Soydev is someone who only has a superficial conception of technology and computers that is restricted to repeating patterns learned from popular workflows on the internet; but who doesn’t dedicate time or effort to learning concepts in a deeper way.
Although soydev is a term with specific connotations, it opens the door to a larger conversation about the depth of our engagement with technology. This superficiality is not unique to soydevs but is a symptom of a broader trend in our relationship with technology.
Most of us start our journey in a skill by having the superficial conception of it. However, some are not satisfied with this superficial conception, and strive to understand what lies beyond the surface.
Understanding concepts from first principles allows us to achieve a deep graceful kind of mastery that when seems almost effortless to others. Deep down lies a lot of effort and time spent in learning and practicing. Innumerable hours of deep thinking and reflecting on why things are the way they are, and how they could be different if you tried to implement them from scratch yourself.
There is also an inherently rare mixture of curiosity and creativity in the process of profoundly learning and understanding concepts in this way. You start not only to ask the “Why?” questions but also the “What if?” questions. I feel that this posture on understanding concepts paves the way for joyful mastery.
Richard Feynman once said “What I cannot create, I do not understand”. You cannot create anything that you don’t know the underlying concepts. Therefore, by allying creativity and discovery with deep knowledge, Feynman’s argument was that in order for you truly master something, you’ll need to be able to recreate it from scratch.
If you are struggling with my abstractions, I can provide some concrete examples. A soydev might be someone who:
First, let’s understand that being a soydev is not necessarily bad, but is highly limited on his ability and curiosity. A soydev will never be able to achieve the same level of mastery as someone who is willing to go deep and learn concepts from first principles.
Now, on the other hand, soydev is bad because it perpetuates a mindset of superficiality. The path of technology innovation is guided by curiosity and creativity. And paved with hard work and deep understanding. Imagine if all the great minds in technology took the easy path of mindless tooling and problem-solving? We would be in a stagnant and infertile scenario, where everyone would use the same technology and tools without questioning or thinking about the problems that they are trying to solve.
Hence, the culture of soydev is bad for the future of technology, where most new developers will be highly limited in their ability to innovate.
I think that soydev culture is highly correlated with the increase of technology and decrease of barriers to access such technology. We live in an age that not only technology is everywhere, but also to interact with it is quite effortless.
My computational statistician mind is always aware of cognitive and statistical bias. Whenever I see a correlation across time, I always take a step back and try to think about the assumptions and conceptual models behind it.
Does the increase in technology usage and importance in daily life results in more people using technology from a professional point-of-view? Yes. Does the increase in people professionally using technology results in an increase of tooling and conceptual abstractions that allows superficial interactions without need to deeply understand the concepts behind such technology? I do think that this is true as well.
These assumptions cover the constituents of the rise of soydev from a “demand” viewpoint. Nevertheless, there is also the analogous “supply” viewpoint. If these trends in demand are not met by trends in supply, we would not see the establishment of the soydev phenomenon. There is an emerging trend to standardize all the available tech into commodities.
While commoditization of technological solutions has inherent advantages, such as scalability and lower opportunity costs, it has some disadvantages. The main disadvantage is the abrupt decrease of technological innovations. If we have strong standardization that are enforced by market and social forces, then why care to innovate? Why bring new solutions or new ways to solve problems if it will not be adopted and are doomed to oblivion? Why decide to try to do things different if there is such a high maintenance cost, especially when training and expanding human resources capable of dealing with such non-standard solutions?
In this context, technological innovation can only be undertaken by big corporations that, not only have big budgets, but also big influence to push its innovations as industry standards.
Don’t get me wrong: I do think that industry standards are important. However, I much prefer a protocol standard than product standards. First, protocol standards are generally not tied to a single company or brand. Second, protocol standards have a higher propensity to expose its underlying concepts to developers. Think about TCP/IP versus your favorite front-end framework: Which one would result in deeper understanding of the underlying concepts?
The rise of soydevs mirrors a societal shift towards immediate gratification and away from the pursuit of deep knowledge.
Despite these unstoppable trends I do think that it is possible to use tools and shallow abstractions without being a soydev. Or, to stop being a soydev and advance towards deep understanding of what constitutes your craft. Moving beyond the ‘soydev’ mindset is about embracing the richness that comes from a deep understanding of technology. Here is a short, not by any means exhaustive list of things that you can start doing:
This post is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
]]>I don’t have social media, since I think they are overrated and “they sell your data”. If you want to contact me, please send an email.
]]>