Andrew's Blog

Easily Hostable Protocols Over Applications

Sat, 25 Apr 2026 00:00:00 -0500

Motivated by Joshua Blais’ blog post about using the internet like it’s 1999 [1] I wanted to discuss an application and a protocol with characteristics that make them bad from a user freedom perspective. I also want to discuss solutions.

Zulip

Zulip [2] is an instant messaging program that looks like a bastion of freedom. I thought it was because it is free software which ought to be the epitome of user freedom, but it is not that simple. Free software is necessary but not sufficient for appropriate user freedom.

Try to find a third party Zulip client for GNU/Linux. You will not. This frequently happens when you use applications and not protocols. Let’s consider three alternative protocols that facilitate instant messaging:

IRC
XMPP
Matrix

Look for third party clients for any of these and they will abound. These protocols were sensibly designed to give users the freedom to decide how they interact with their data and the data others show them.

Nothing is stopping you from creating a third-party client for Zulip. The problem is applications aren’t designed with this in mind, but protocols are. The main reason applications generally have so few clients is they frequently change their interfaces in breaking ways. Quality protocols are designed from the ground up to ensure consistent interfaces, often with backwards compatibility in mind.

AT Protocol

The AT Protocol [3] is used by the social network Bluesky. This is a common way people introduce the protocol, and it should give you pause that a protocol is frequently associated with a singular company and social network.

Choices in the design of the AT Protocol have led to it being too costly for individuals to run full-fledged servers (see [4] for details). This results in most AT Protocol users being entirely dependent on the Bluesky organization [5] (this link is worth following). Having a strong reliance on a VC backed social media company [6] is a good way to become the product when investors demand returns. This also limits freedom by having a centralized authority with the ability to censor speech [7].

An alternative to the AT Protocol is ActivityPub. Mastodon [8] uses ActivityPub to facilitate microblogging, and the software can be hosted on cheap hardware without a reliance on a third-party. Proponents of the AT Protocol may claim the two perform different functions, and I agree, but the centralized feed functionality the AT Protocol can provide but ActivityPub can’t is something I consider to be deeply problematic from a human psychology perspective. As such, while ActivityPub can’t directly replace “your” BlueSky feed, it will give you a healthier microblogging experience. With that said, I think the case for microblogging is lacking, and individuals should consider if microblogs, both their own and those of others, are worthwhile (my answer is emphatically no).

Proposition

Host your own services for the dissemination of information, ensuring the services you host implement protocols designed for the type of information you are disseminating.

I use Git (not GitHub) for software distribution, RSS, Gemini [9], and HTTP for broadcasted textual information, email for bidirectional individual or small group communication, and XMPP or IRC for larger group communication. These are the protocols built for these use cases, and they should be used as such.

Citations

[1] - https://joshblais.com/blog/using-the-internet-like-its-1999/

[2] - https://zulip.com

[3] - https://en.wikipedia.org/wiki/AT_Protocol

[4] - https://dustycloud.org/blog/how-decentralized-is-bluesky/

[5] - https://arewedecentralizedyet.online/

[6] - https://bsky.social/about/blog/03-19-2026-series-b

[7] - https://bsky.social/about/support/community-guidelines

[8] - https://joinmastodon.org/

[9] - http://geminiprotocol.net/

Config Files Suck

Wed, 19 Nov 2025 00:00:00 -0600

Config files suck for three main reasons.

Reasons

They often require learning an idiosyncratic way of defining things that doesn’t generalize across programs
They are annoying to version and track
They limit the configurability of software

Explanation

Learning Useless Things Sucks

Config files suck because learning useless things sucks. Most config files follow their own definition format. This results in ephemeral information being learned and very quickly forgotten. Often this process repeats a few times for pieces of software that require reconfiguration to meet the expectations of the user.

The time spent understanding the DSL of the config file would likely be better spent writing code in a proper programming language. There are two primary benefits to this. First, it builds on prior experience. If you are skilled at writing C code your ability to understand and configure software that is configured in C is improved by this prior knowledge. Second, by doing this process of configuration you are learning information that is likely less ephemeral. There are only a few good programming languages and by programming in them your proficiency in using them increases.

The Disconnect Between Configuration and Source Code Sucks

Consider my dotfiles [1]. There are three files in my dotfiles that don’t reside in the XDG_CONFIG_HOME directory. These three files are my .bashrc, my .Xresources, and my .xinitrc files. I have these in my XDG_CONFIG_HOME directory and symlink to them from their expected locations because the programs don’t respect the XDG_CONFIG_HOME directory specification. This is annoying to setup on each system I use. A simple solution to this is to not use software that doesn’t respect the spec, but when Xorg and Bash don’t respect the spec, it becomes a bit more difficult to do so.

The bigger issue is that the disconnect between the configuration of software and source code results in multiple sources of information contributing to the configuration of a software program. This is annoying because to achieve the same program configuration across systems one must ensure two different locations are synchronized instead of one.

Being Limited in What You Can Change About Programs Sucks

Config files often don’t allow for large changes to be made to programs. The changes that can be made are constrained to what is deemed reasonable by the developers of the software. This is not similar to proprietary software where the developers actively hinder the freedom of users, but by having the source code and the configuration done in the same location, you can fundamentally change the operation of the program without jumping between different locations.

The Redeeming Factor

As I see it, config files are worthwhile if users are expected not to build from source. This may be the case for many open source projects, but I personally dislike using software in this way. This is a matter of opinion, but I think one of the most valuable parts free software is the freedom to change the source code as I see fit. By distributing software with config files, this sends an implicit message to users that they are expected to use the software as is, without modifying the source code.

If you don’t see a reason to build from source, it is your right to continue using config files. If you are building a project where you want users to configure, extend, and customize their computing experience to the fullest extent, consider doing away with config files as it forces users to directly manipulate your codebase, giving them more autonomy and flexibility. This may hinder mass adoption, but is that why you build software? It’s not why I do.

Citations

[1] - https://git.laack.co/dotfiles

You Don't Need Anything

Fri, 17 Oct 2025 00:00:00 -0500

Context

The words we say have an impact on how we think. A word I find problematic is the word need.

Definition

Need: a requirement, necessary duty, or obligation [1].

My Thoughts

Do You Really need Anything?

I often say this when people use an unqualified need. An unqualified need is as follows:

I need to eat food, I’m starving!

Aside from the fact that they likely aren’t starving, they don’t need food. This can be thought of in a similar way as the is-ought problem [2]. If someone says they need something, you can ask them, “Why do you need it?”. In the case of food they may say, “I need food to survive”, and to this you may say, “Why do you need to survive”, and this can continue indefinitely as needs are predicated upon something. Often need has an implicit qualification as is the case of, “I need food” implying that it is needed to survive, but you don’t need to survive. I don’t want you to die, but it is not necessary for you to live. Tying this to the definition of need, there is no requirement, duty, or obligation for someone to be alive. There are no universal requirements to do things, no one has a universal duty to do things, and no one has a universal obligation to do anything. You may need to complete a project at work to not get fired, but you don’t need to complete the project, you can just get fired, you are not obligated to not be fired.

The danger in statements like, “I need to eat food”, is they create a dependence upon something. By saying you need something you are telling yourself that without it you are incomplete. You are beholden unto this thing. This is dangerous because it leads to acts of immorality because of the perception that something must be done. I believe this is what has led to mass surveillance. People think they need to do what their boss tells them to do even when they know it’s wrong.

YOU DON’T NEED ANYTHING. You want it because you perceive the consequences of not having it are worse than having it. This is not a need. This is a want. Understand the difference.

You don’t need to live. You don’t need food. You don’t need water.

Citations

[1] - https://www.dictionary.com/browse/need

[2] - https://en.wikipedia.org/wiki/Is%E2%80%93ought_problem

Stop Collecting User Data

Sun, 12 Oct 2025 00:00:00 -0500

Problem Statement

Sending the data of people who use applications you built, by default, for any purpose that is not strictly required for the application to function is morally wrong.

Why Does This Matter

This matters because humans are trusting. It abuses this trust by tracking unnecessary data about application usage because most humans implicitly assume this is not being done, and they often don’t understand what the consequences of this tracking can be [1][2]. Additionally, it is unreasonable to expect users to look through your source code, all of your settings, and your docs to understand what data is being collected. If data is being collected, it should be obvious based on the purpose of the application, and if it is not obvious that it must be collected for the application to work, this should be made explicitly clear to users in the most obvious way possible.

Counter Arguments

But it is necessary to track errors so we can fix bugs and improve UX

Yes, this is often the case. Does the Linux kernel collect logs? Yes! Do they upload them to a server for aggregation? No! This is how error logging should be done. Write your logs to a log file, but don’t automatically upload them to your servers. If a user has an issue that they would like addressed, they will let you know about it. If they don’t notice or don’t mind the issue, it’s their right to not report it. Some users may not want to deal with the hassle of uploading logs when things break, so they may prefer to have an option to automatically upload their logs. This is totally fine, but only if they are informed about what is being logged and it is an opt-in.

But it is necessary to track usage to understand what users want

No, it isn’t. GitHub (bleh) issues exists, Discord (ick) exists, Matrix exists, email exists, there are countless ways software projects crowd source improvements to their applications, but it should not be done using mass surveillance. I would argue it is acceptable to have an opt-in option to collect usage data, but I do wonder about the soundness of the minds of people who choose to opt-in to such surveillance.

Towards a Solution

Use applications that respect your privacy. If an application you are using collects your data and is not proprietary, it is quite likely there is a fork of it that strips out the data collection, see ungoogled-chromium [3] and LibreWolf [4] as examples. If one doesn’t exist, consider making one.

If user-respecting alternatives don’t exist and the application is proprietary, consider using WireShark [5] to see what domains the application is resolving. Once you find the data collection domains, add these domains to your /etc/hosts file or self-hosted DNS server (like a Pi-hole), and have them resolve to 0.0.0.0. This doesn’t always work because the domain that is collecting data is sometimes used for to support the core functionallity of the application, but in an ideal world this should not be necessary as you shouldn’t be using proprietary software to begin with.

Citations

[1] - https://en.wikipedia.org/wiki/Cambridge_Analytica

[2] - https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

[3] - https://github.com/ungoogled-software/ungoogled-chromium

[4] - https://librewolf.net/

[5] - https://www.wireshark.org/download.gmi

AdNauseum, Track Me Not, and Privacy Through Obscurity

Sat, 04 Oct 2025 00:00:00 -0500

Context

AdNauseum [1] is a fork of uBlock Origin [2] that hides ads, clicks them in the background, and aggregates the clicked ads in an easy to view interface. The key difference between UBlock Origin and AdNauseum is AdNauseum both hides and clicks ads.

Track Me Not [3] is a browser extension that mimics human search queries to obscure real queries in the noise.

Usability

AdNauseum is as effective as uBlock Origin at hiding advertisements. For a user, there is no trade off in the usability of the web when compared with uBlock Origin. This is nice because more restrictive approaches to privacy, like GNU IceCat [4] and Tor [5], hinder modern web usage.

Similarly, Track Me Not has almost no impact on the usability of the web. I say almost no because the traffic it generates likely increases the probability of being shown CAPTCHAs, given that the traffic it generates is likely distributionally different than normal traffic.

Effectiveness at Improving Privacy

AdNauseum is likely worse at protecting your privacy than UBlock Origin. By clicking ads in the background, there is an inherent trail of where you have been. In some ways this trail exists the moment the ad network sells the ad you are shown, but by using a more obscure technology than UBlock Origin, you are more likely to be fingerprinted. Despite this, AdNauseum makes the modern web a better experience to use than not having an ad blocker.

Track Me Not may also hinder privacy, depending on your privacy goals. The inherent problem is it is phoning home to search engines over time, giving them information about where you are. Search providers are also likely to fingerprint you on the basis of these strange searches. The possible saving grace is that by having so much noise in your search history, it could difficult to figure out what you are interested in. I am uncertain about the value of this though as there are likely going to be two search distributions; Track Me Not’s, and yours. This allows sophisticated search providers to ignore the synthetic requests and track your real requests while also gaining real time IP information.

In short, I find the idea that these tools improve privacy to be dubious.

Why You Still Might Want to Use Them

The value of these tools is they are a form of active resistance against ads and tracking. While your privacy is likely hindered by them, you are sending a message. That message costs ad networks and search providers money. In the case of AdNauseum, clicks are expensive for advertisers because most people ignore ads, and Track Me Not imposes computational costs on search providers. Even so, by clicking on so many ads ad networks may catch on and stop charging as much per click. Similarly, search providers may block you or start giving CAPTCHAs which likely impose less computational costs on them than running a query.

My Thoughts

You probably shouldn’t be using them. While I enjoy active resistance, this is unlikely to be the right way to do it. It may mess with their knowledge about you if they are not sophisticated, but any sophisticated search provider or ad network, which I think most of them are, will easily sus out inauthentic traffic and gain more information about you as a result.

Instead of using these tools, I recommend doing the following to improve privacy:

Use Tor when possible
Avoid sites that require signing in
Don’t use social media
Use a privacy respecting browser
Use UBlock Origin
Use a Pi-hole [6] and privacy respecting DNS servers
Minimize your usage of search engines
Use a variety of privacy respecting search engines
Use local AI tools (if you must use any at all)
Use E2EE messaging when possible
Minimize the trackability of financial transactions
Don’t carry a phone and minimize your usage of it
Avoid proprietary software and software that collects data

This results in me doing the following:

Using Tor for most of my traffic
Avoiding sites that require signing in
Not using social media sites
Using LibreWolf [7] as my default non-Tor browser
Routing all non-Tor DNS requests through a self-hosted Pi-hole with additional domain filtering
Using DuckDuckGo and a variety of public SearX instances for search
Running Ollama [8] models locally
Preferring communication with PGP encrypted emails, Matrix, or Signal
Using cash or Monero when possible for transactions
Not carrying my phone with me and only using it when it is the only means of achieving a specific goal (ie. SMS 2FA, communication with certain individuals, etc.)
Only using Libre software that doesn’t collect data

Unfortunately there are sometimes exceptions to the above for the purpose of completing my work in an efficient manner, but in my personal life, I am unwilling to compromise on these things.

Citations

[1] - https://github.com/dhowe/AdNauseam

[2] - https://github.com/gorhill/uBlock

[3] - https://github.com/vtoubiana/TrackMeNot

[4] - https://www.gnu.org/software/gnuzilla/

[5] - https://www.torproject.org/about/history/

[6] - https://pi-hole.net/

[7] - https://librewolf.net/

[8] - https://ollama.com/

The Sustainability of YouTube

Sun, 28 Sep 2025 00:00:00 -0500

Context

I dislike using cloud services because they may discontinue my service [1] or they may do something stupid [2] that negatively impacts me. These concerns, along with concerns about privacy [3], have led me to keep information and content I care about away from cloud services. This does make me wonder, how many people would be distraught about the loss of their content if YouTube terminated their accounts? This is not the topic today, nor is it something I can easily answer, but it is something I wonder about and would like others to consider.

Similarly, I am skeptical of “free” services. It’s incorrect to say “if something is free, you are the product” because charity does exist, but when it comes to Google, they aren’t a charity. Their current model with YouTube is to have people upload videos to their site and show ads to some users when they watch said videos. There are also paid subscriptions, but their primary monetization comes from ads. An important point is they don’t purge content on a regular basis, except in cases of ToS violations. As such, there is a (nearly) monotonically increasing function that describes the storage requirements of YouTube. This motivates my question below.

Question

When will YouTube’s storage costs exceed their revenue if they don’t start purging old content, assuming their revenue does not increase over time?

How to Answer This Question

We need the following information to answer this question:

What is YouTube’s annual net profit?
How much data does YouTube store?
How much does data storage cost?

YouTube’s Profit

According to Alphabet’s 2025 Q2 earnings release [4], YouTube ads made a revenue of $9.769 billion. Annualized, this is $39.076 billion, but this is only revenue, not net profit. If we assume the operating margin across Alphabet matches the operating margin of YouTube (32%), we find an approximate net profit of $12.50432 billion / year. Actual net profit could differ from this, but since we are concerned with how much data storage this could support, we don’t need to factor in how this would be taxed.

Storage Needs

Total Videos

YouTube states on their official blog there are over 20 million videos uploaded per day [5]. While I don’t trust YouTube very much, and they don’t have many incentives to be honest on this topic, they seem more trustworthy in this context than the slop factory sites as they are, in fact, the ones who are hosting the content. As such, I will accept this metric.

Average Video Size

I wrote a python script that uses a curated list of popular Google Trends searches over the past few decades [6] to search YouTube for recently uploaded videos. I ran this script and compiled a list of ~7.65 million YouTube videos.

Before continuing, I will list a few limitations of this approach:

YouTube likely imposes some amount of algorithmic filtering when sorting by ‘recently uploaded’
The videos in question are all public (not inclusive of private/unlisted videos)
Less popular search terms may have a different distribution of video sizes

These are the main flaws in my methodology, but any approach will be imperfect without being able to get the data directly from YouTube.

Of these 7.65 million videos, I sampled 615,222 of them and queried YouTube using yt-dlp [7] to find all video resolutions and formats YouTube will serve. It seems unlikely to me that YouTube stores each of these resolutions on their servers, but I think it is very likely that YouTube is storing the highest resolution version they are willing to serve to users.

Based on my findings, I propose a lower bound of ~396.17 MB / video, which assumes they are only storing the highest resolution version and all other versions are generated in real time via transcoding (I am confident this isn’t the case, but it provides a nice lower bound). I also propose an upper bound of ~1.44 GB / video, which assumes they are storing every resolution and format for each video they are serving.

All of the code used for this is available on my git server [8].

Annual Storage Increase

Using my findings above about video size and YouTube’s stated video upload rate, we find:

Lower bound:

7.923 PB / Day
2.89 EB / Year

Upper bound:

28.895 PB / Day
10.547 EB / Year

Note: These values may vary depending on rounding, but they should be similar to what anyone else would find.

Storage Cost by Volume

GCP currently charges $26 / month for 1 TB of standard multi-region, US based, cloud storage [9]. If we assume the same 32% profit margin as before, this would cost ~$17.68 / TB / month or $212.16 / TB / year. I don’t know if this is high or low relative to what they actually pay. YouTube requires quick access to many of their videos, but many of their videos are likely retrieved infrequently. Additionally, it seems likely Alphabet’s cloud storage margins are higher than the average margins across the organization. Additionally, these are only US storage prices so this could vary depending on the regions this data is being hosted in. In any case, I think this is a fair estimate.

Answer to the Question

Given YouTube’s approximated net profit of $12.50432 billion / year and an estimated cost of $212.16 / TB / year for cloud storage, we find their profits can support an additional ~58.94 EB of data.

At the lower bound of 2.89 EB / year we find YouTube’s storage costs will surpass their current profits in ~20.39 years.

If we assume our upper bound of 10.547 EB / year we find YouTube’s storage costs will surpass their current profits in ~5.59 years.

Conclusion

These are very rough bounds, especially given how difficult it is to estimate the cost per TB / year for storage of this data given their retrieval needs, but we find that in ~5.59 - ~20.39 years, YouTube will be forced to start purging old content to remain profitable at their current profit rate.

Citations

[1] - https://killedbygoogle.com/

[2] - https://arstechnica.com/gadgets/2024/05/google-cloud-accidentally-nukes-customer-account-causes-two-weeks-of-downtime/

[3] - https://www.gnu.org/proprietary/proprietary-surveillance.gmi

[4] - https://www.sec.gov/Archives/edgar/data/1652044/000165204425000056/googexhibit991q22025.htm

[5] - https://web.archive.org/web/20250911091711/https://blog.youtube/press/

[6] - https://www.kaggle.com/datasets/dhruvildave/google-trends-dataset

[7] - https://github.com/yt-dlp/yt-dlp

[8] - http://git.laack.co/blog/log.gmi

[9] - https://cloud.google.com/storage/pricing#multi-regions