The last panel I attended was moderated by Kevin Rose from Digg.  The conversation touched on topics like dedicated vs. cluster hosting, caching solutions, and development techniques.  The session was so crowded, that when I got there, there was a line out the door because the room had reached capacity.  They were doing a one in and one out rule, so after 15 minutes I was finally let in.  My bad for not showing up early.

The panel attendees were:

Cal Henderson Flickr
Jeff Stump Digg
Matt Mullenweg Wordpress
Garrett Camp Stumbleupon
Chris Lea MediaTemple

Buy vs. Rent vs. Cluster

What is more affordable: Buying servers and co-locating them, renting dedicated servers, or going for a clustered hosting solution?  A clustered hosting solution is like MediaTemple’s Grid Hosting.  Their promise is affordable performance: instead of running your site off one server, take advantage of their 200+ computer cluster. They handle load balancing and can help your site from getting pwned if it ever gets on the front page of Digg or on Techcrunch.  For the run of the mill blog or web application this will do.  If you are looking to a customized/specialized solution, well then this kind of hosting isn’t for you.  Contacting MediaTemple may work, but if your application hurts the performance of the other sites on the cluster, then say hello to buying or renting dedicated servers.

Cal from Flickr said buying servers and co-locating them was economical 3 years ago when Flickr started.  It still is since they require so much storage space for all their photos.  For sites with less storage requirements and not looking to invest in servers, Matt from Wordpress said you should look at hardware as a service.  For Wordpress, renting servers was the economical thing to do.  Renting means that there is someone else responsible for hardware issues and uptime.  On their own it would be impossible to hire someone full time to perform operation duties.

Read, Writes, Queues, and Dirty Caching

Users of web applications perform two main actions: viewing and writing.  When a user views a page of a web application it generated server-side and is delivered as static content to the browser.  If you have 1000 page views, this page will be generated 1000 times.  Generation results in data retrieval and processing.  Now if you cache that one view, your server retrieves the data once and stores it for the 1000 or even 100,000 viewers at a negligible cost.

Flickr is an example of a site where a majority of the action is users viewing photos.  The photos on Flickr can be 200kb to 10mb.  Retrieving these images from a database when a user views a page is expensive.  Cal mentioned that Flickr caches all the user’s images to ensure good performance.  Flickr displays an older version of the cache to people that are not logged in; this boosts speed while catering to logged in users.

A big question that comes up is: How long do I cache this information for?  The answer is it depends.  At Digg they cache user data indefinitely.  How many times do you update your username?  Your email?  Not that much, hence why they cache it with no expiration.  When a user updates their information, Digg will invalidate the cache and let it regenerate.

Remember: If you do not have caching and get 10 million hits to your site, your database will be called 10 million times.  If you have time sensitive data, try caching every second.  By doing that, it means there will be 86,400 calls to your db every day instead of 10 million!

When a user writes to a web application it means they are storing information in a database.  This can be updating your display name, writing a comment, or uploading a video.  While there is no way to cache a write (unless you have some magic sauce that can predict what a user is thinking), you can increase performance by queuing write actions.

Digg is a web application that must handle hundreds of writes per second when thier users "digg" a website.  To scale a service like this, the architecture must be clever and queue the write actions.  People can use open source solutions like Gearman, Starling, or what the panel endearingly called a Ghetto Queue in MySQL.

A good litmus test for figuring out what can be queued, just ask yourself "How long is it till a user cares or gets mad if this data is not updated?".  At Digg they queue all "diggs" in a small queue to handle the load.  When a user "diggs" a website it updates the cache on their computer to show it has been "dug" and the action lines up in the queue.  A figure was tossed out that in about 15 seconds it gets pushed into the database.

Flip-flop of Engineering and Hardware Costs

As a startup, engineers work long hours with little or no pay; their sweat comes from the passion of the project.  Buying hardware is a painful process emptying savings accounts and raking up debt.

One of the questions from the audience was "We have 1 database partitioned over seven servers running MySQL, how do we make it perform better?  Is there any code to help us?"  One of the panelist replied, buy more hardware, it is cheaper than hacking away at code to make things more efficient. 

As a company scales engineering costs get more expensive.  Salaries, benefits, and time spent not working on new features.  In perspective hardware becomes a cheaper and more effective solution at efficiency.

Partitioning for Stability

This one is a no-brainer.  One database on one server will not be able to scale and will be a nightmare if there is a hard drive failure.  Partition the database across multiple machines and enable replication!

Bottlenecks

What causes bottlenecks? Well for one, a panel consensus, it is not the language of your application.  Use PHP, Python, ASP or Ruby.  So where do they come from?  One way is to imagine your site had 10 times the users and 20 times the page views.

Most likely it is going to come from your db.  Take a look there.  It usually comes from a few tables, for Digg it is comments and diggs.

Also take a look at your lesser used functions.  Sometimes these are admin functions.  Or sometimes it is shitty code.  High use or even a DOS attack can aggravate a bottleneck and take down your site.

Communities Don’t Scale

This was a great point by the panel.  While growing communities may increase your user base and traffic, it is not always healthy.  The best thing to do is communicate with your users and developers.  Have town halls and encourage meet ups.  Give your users options when you add new features.  Allow them control of the interface and let the power users turn things off.  Also give them some control of the contributions of the community. On Digg this is in the form of "burying" a website.  On Flickr it is marking something as inappropriate.

To encourage good growth it is also important to allow segregation.  Accept that you have your old users, your new users, and your power users.  Flickr allows users to create niches by forming groups and participating in forums.

Going International

They key to going international is caching in data centers around the globe.  Cache your CSS, javascript, site images, and application data.  Let the consumption experience be as speedy as possible.  Dynamic calls for viewing is what hurts.  Minimize that as much as possible.

Try to store as much information as possible locally.  Matt from Wordpress noted that it is better to store data in the United States because of privacy concerns.  It is also good to have one main location for your application and cache things globally for simplicity.  Wordpress uses a company called Panther for their content delivery network.  Flickr stores all its data in the US and caches it in data centers across the globe.

What about write activities?  Aren’t those slow? For most applications write activities are not the majority of calls.  If they are, a good balance between queuing and caching locally should do wonders.

Performance of APIs

Creating great APIs can be like an art…or a pain in the.  From Digg’s perspective, an API is just a different presentation of the data.  The API uses the same infrastructure as the site, so it takes advantage of all the caching and queuing logic.

Cal from Flickr joked that APIs are a way for users to suck all your data in one fell swoop.  Most of the time it is from researchers who are trying to create visualizations using Flickr images.  At Flickr they employ throttling to control the load.  In the beginning don’t stress to much about getting throttling right, you will learn from developer community their usage over time. 

Developer Tips

The panel had a nice mix of people who embrace development methodologies and those that do not.  Here are some random snippets of advice and practices:

  • Have 2 people own a portion of the project.  If a developer quits, get sick, or "gets hit by a bus" you will still be able to chug along.  Not just that, it will increase the quality of code and ownership over different parts of the application.
  • Design projects in a wiki.  If you plan ahead of time and document your development as you go, good things happen.  It lets you catch coding issues early.  It helps new coders ramp up quicker.
  • If you have remote employees, great.  Make sure there is good communication and try to get them on-site once a month.  Brining people together is encouraged.
  • Perfect team sizes is 3 to 6.  It is easier to come to consensus in smaller groups.  Consensus building makes things move extremely fast. Cal from Flickr quipped, if you have a problem with building consensus, just hire people who always agree with you.
  • As you scale your team, more structure is needed.  You may hate it, but in order to survive, it is needed.
  • Be able to push rapidly to your production environment.  Digg can push daily, but they are looking to push once every two weeks.  EBay pushes 3-4 times a week.
  • Deploy your product to a smaller set of your users.  It will help you identify performance issues and catch bugs.
  • Yay for unit testing, branching, svc, etc.

These are my notes and thoughts while attending sessions from South By Southwest 2008.  Over the next couple of weeks I will be going through my hand-written notes and summarizing them. SXSW was a place for sharing ideas, so feel free to distribute and link to.  See all SXSW notes here.