Is There A Simpler Way?

The odds are that if you are reading this blog and you're an engineer, your problem scope is not as big as Google's. You are not Google. It is more impressive to come up with a simple, elegant solution than it is to explain a complicated one to a room full of bewildered onlookers.

Is There A Simpler Way?

The odds are that if you are reading this blog and you're an engineer, your problem scope is not as big as Google's. You are not Google. It is more impressive to come up with a simple, elegant solution than it is to explain a complicated one to a room full of bewildered onlookers.

Chaturbate staff encounter complex problems that may take an alternate, close approximation for a solution. And that's preferred. If there's a simpler way to implement a solution, what are the costs for doing so? Refactor and refine complex code and your collaborators will thank you. Explore your own viable solution and data structures before introducing external libraries that carry their own baggage in the form of dependencies, dependencies-of-dependencies, security vulnerabilities, memory/thread/file handle/process leaks, and memory corruption issues such as buffer overflows and buffer over-reads.

source: https://nixos.org/~eelco/pubs/phd-thesis.pdf

User Relationships

Take for example typical online social networks with inter-user relationships. These relationships form graphs in computer science that can be directed or undirected. A Facebook friendship relation is undirected, because if Alice is a friend with Bob, then Bob must be a friend with Alice too. There's no direction- they're friends with each other.  Twitter doesn't require Alice to follow Bob back, so the follow graph must have a direction. In the case of Chaturbate's follower system, the graph is directed and can have cycles: Alice can follow Bob, who can follow Charlie, who can follow Alice.

Traversing these graphs is of interest with account linking and examining suspicious behavior with token flows; however, the decision had to be made as to whether it's more important to have fast real-time lookups with minimal depth or fast traversal? User experience trumps all, so the former wins.

Storing Followers

There are many ways to represent graphs in a computer, for example with adjacency matrices, lists, and trees. Chaturbate chose to store followers in a distributed hash table keyed on the user id and with a value being a set of user ids following that user.  For the other direction, users that one is following, the same thing. There are intricacies with the hash table having buckets that are sharded and replicas that span data centers, but rather than using a graph database (slow real-time querying) or an RDBMS with relational mapping (that would invoke referential integrity triggers with high write volume) or a column store (that has range compaction slowdowns on deletion), the simple key/value store does the job and does it extremely fast without caching.  

Data Storage Complexity

It's easy to get caught up in the latest trends in frameworks and databases of the day. Technology is always evolving and in the case of persisting data, many things have changed since the advent of relational databases. The types of data Chaturbate deals with are unpredictable and mutate frequently. Furthermore, Chaturbate has a lot more data to process today than it did in 2011. Websites do not have a luxury of going down for maintenance to alter, upgrade, or install components. Applications demand response times of a few milliseconds rather than "sub-second" query times being acceptable.

There are many types to choose from too: graph databases, probabilistic databases, temporal databases, in-memory databases, or cloud databases. The great NoSQL Movement started pre-2012, followed by NewSQL. While we fully embrace technologies like Redis Cluster and Couchbase, traditional RDBMS solves the problem most of the time with some adjustments if necessary such as with partitioning or sharding.

PostgreSQL Extensions

Three Postgres Extensions for the Data Engineers under the sky,
Seven for the stakeholders in their halls of stone,
Nine for SREs doomed to die,
One for the CEO on his dark throne
In the Data Center where the Data lie.
One Postgres Extension to rule them all, One Postgres Extension to find them,
One Postgres Extension to bring them all, and in the darkness bind them,
In the Data Center where the Data lie.

PostgreSQL is great, I don't need to say what it is or what it does. What particularly makes PostgreSQL stand out of a crowd of complicated data stores is the open and welcoming stance towards data types. There are 21 subsections in Chapter 8: Data Types. Going from a simple relational database with geospatial capabilities, it is now a multi-model database by adding support for JSONB.  If you need a new data type, you can make an extension for it. An extension can unlock a new data type, a foreign data wrapper, or some new use-case.

pg_stat_statements

Topping the list of valuable pg extensions in Chaturbate is pg_stat_statements found in the contrib directory of a PostgreSQL installation. Install this extension with

CREATE EXTENSION pg_stat_statements;

This extension will start recording queries redacted of values and then saves information about how long it took and the effects of underlying reads and writes. It will identify queries that took a long time to run as well as queries that are very frequently ran:

postgres=# select * from pg_stat_statements where query like 'INSERT%' order by calls desc limit 1;

userid              | 77764
dbid                | 16402
queryid             | 675261164826881863
query               | INSERT INTO "mytable" ("x_id", "y_id", "z_id", "created_at", "ip") VALUES ($1, $2, $3, $4::timestamp, $5::inet) RETURNING "x"."id"
calls               | 37305
total_time          | 25779.5935239998
min_time            | 0.148037
max_time            | 164.832562
mean_time           | 0.691049283581291
stddev_time         | 1.60218130530351
rows                | 37305
shared_blks_hit     | 2559651
shared_blks_read    | 109706
shared_blks_dirtied | 62584
shared_blks_written | 0
local_blks_hit      | 0
local_blks_read     | 0
local_blks_dirtied  | 0
local_blks_written  | 0
temp_blks_read      | 0
temp_blks_written   | 0
blk_read_time       | 12075.251073
blk_write_time      | 0

TimescaleDB

TimescaleDB is an open-source time-series database packaged as a PostgreSQL extension.  Chaturbate uses Timescale to record event data for split testing. The events are automatically partitioned across time and space (by using a user_id partitioning key).  When you query a Timescale-enabled database, you will see what look like regular tables but are actually an abstraction of chunks. This abstraction is called a hypertable.

ZomboDB

ZomboDB is an open-source extension that links PostgreSQL with Elasticsearch. Chaturbate uses this extension for full text search of chat data to help with room compliance.  The beautiful thing about this extension is that it can work on existing data. If you have a text field already, you can apply this extension and then create an index through psql that will call out to your Elasticsearch cluster API to create the index on the ES cluster. You can use a Layer 7 proxy on top of the ES endpoint to have highly available access to it.

CREATE INDEX idxchatmessages_chatmessage
                     ON chatmessages_chatmessage
                  USING zombodb ((chatmessages_chatmessage.message))
                   WITH (url='http://zombodb@10.13.13.10:19201/');

The simpler design is likely more secure

Large interconnected networks expand attack surface area. Interleaved dependency trees and functions that do multiple things and return a wide range of outputs with unknown side effects increase the overall risk to users and the organization. Every feature that is added to an application adds an amount of risk.  There will be less bugs and therefore less time debugging when a program is written incrementally with unit tests validating each piece along the way.

The Chaturbate apps platform is an example of simplistic design that can be described in a few words: user uploaded javascript runs inside a JavaScript engine like V8 embedded in another application and responds to events from the Chaturbate webservers. Each app has their own execution environment V8::Context, which means that separate, unrelated JavaScript applications can run in a single instance of V8 like tabs would in a web browser.

Functions for the apps service have a single, clearly defined purpose with low cyclomatic complexity. The Context is explicitly set and you can enter and exit it any number of times; however, the execution environment imposes sandboxed execution with its own set of built-in objects and functions.

Chaturbate knows that anything uploaded from a user to the server to be executed should be by default considered untrusted code. By using the Context execution environment, restrictions are automatically imposed without having to implement extra network-limiting api calls.