I mentioned the new cluster-based/autoscaling setup we were moving to previously and the initial work deployed correctly but there were a few bottlenecks found which caused serious performance issues.

The issues are now resolved on both Loforo and our image/video host (Thumbsnap) so responses should be much faster.

For the technical folks who want specifics:

  • NFS file-shares over the private network didn't perform as expected leading to slow transfers that was using a tiny portion of our full bandwidth. This was fixed with NFS caching (via cachefilesd)

  • The database was getting hammered by background-jobs when containers were scaled up and down (media encoding, feed generation, caching). Jobs were rewritten to wait longer before starting instead of at-startup.

  • High rates of media access was causing a race condition where the same image/videos were being repeatedly fetched from the storage servers (code changes)

  • The new clustered setup gives us a massive amount of bandwidth (multiple gigabits) and also makes the service able to self-heal and scale automatically as traffic increases

Let us know if you notice anything still broken.