Who is Fake Jeremy Liew? June 27, 2008Posted by jeremyliew in blogs.
I got an email today from Kyle Brady titled “Response to Your Response on My Blog” which puzzled me as I hadn’t come across his blog before. I clicked through to his post; he had been called by a recruiter about a job at my portfolio company Rockyou, turned it down and explained, in some detail, and with great enthusiasm, why.
Someone claiming to be Jeremy Liew posted a couple of comments, parts of which I’ll excerpt below:
Kyle: I think you made a poor decision that you will one day regret deeply.
I have been a venture capitalist for many years (I got my start in the early days of venture capital back in 2006) and in all these years have seen few companies with as much potential as RockYou…
We’re still not sure what the final business model for turning those widget views into cash will be, but the opportunities are endless. I came up with 3 this morning alone…
I guess you’d have been one of those people who had the opportunity to work at a great startup like Google, Webvan, Amazon or Pets.com before they went public and would be regretting that decision to this day…
You’re young so I will give you some advice: the sooner you get in on these things, the better. You get more stock options, a beter strike price and you get to be a part of a revolution at the same time…
Sleep on it tonight and tomorrow give that recruiter a call back. Tell him Jeremy Liew referred you.
You’ll thank me for it some day soon. I guarantee it.
Lightspeed Venture Partners
“Investing at the Speed of Light”
*** Sent from my iPhone 3G ***
The comments linked to the Lightspeed website, and even had the right email address (as is evidenced by the fast that Kyle emailed me a very polite response, despite the nature of the comments). I’m no Steve Jobs, so I was pretty surprised to see that there is a fake Jeremy Liew out there!
IMVU selling over $1m/mth in virtual goods June 25, 2008Posted by jeremyliew in business models, games, games 2.0, gaming, mmorpg, virtual goods, virtual worlds.
GigaOm says that IMVU is generating over $1m/mth in revenue:
Flying under the proverbial radar for the last four years, the web-based virtual world chatroom IMVU has released new jaw-breaking data: Since April 2004, it has amassed 20 million registered accounts, with 600,000 of those active monthly users. By comparison, Second Life took five years to acquire about 550,000 active users.
The company, well known to web surfers because of its ubiquitous ads, is now earning $1 million a month in revenue, 90 percent of that from the sale of virtual currency and 10 percent from banner ads embedded in its interface, CEO Cary Rosenzweig said. That works out to about $1.66 a month per active user.
This is within the range of monthlyly ARPU for MMOGs. The article notes that Peak Concureent Uers (PCU) is around 70k, so ARPU based on PCU is >$14/mth.
IMVU sells credits to its users who use these credits to buy items from their catalog to personalize their avatars. The vast majority of the items in the catalog are user generated. Of the 20m registered users, 100k are registered to make items in the catalog, but only a fraction of these are active. Still, they have created 1.7m items, so are averaging 20+ items created per active “item maker”.
Howard Marks spoke at Paris GDC today, and Worlds in Motion has a writeup, including quotes from Marks about key stats for Acclaim’s free to play games:
For example, profitability is often measured through average revenue per user (ARPU). “Most of the time ARPU is $30-$40 a month,” said Marks. “A month! Not just one time.”
“The next thing is percentage of uniques,” he continued, defining “uniques” as players who spend money on a given game. “We’ve found in Asia, in Korea, we’ve found that 10% of people will spend money! I think it’s great! In the United States it’s less, more like 5-10%. But we’re getting there.”
The average lifetime for a player in the free-to-play space is 3-4 months per game, less than what is generally expected for a more traditional subscription MMO.
That statistic leads to churn rate, which describes player loss per month. “It turns out you lose a lot,” admitted Marks. “You should be prepared to say, ‘I only brought in 100,000 players this month, but only 10,000 stayed.’ That’s okay! That’s okay. Some of them will come back, and you can always get more.”
He also says (according to Worlds In Motion) that Acclaim will make $30m in revenues this year and break even on their core games business, but is investing heavily in R&D. Those are great numbers, and certainly place Acclaim to the front of the pack for free to play publishers in the West.
The elements of Crafting as a game mechanic June 23, 2008Posted by jeremyliew in game design, game mechanics, games, games 2.0, gaming, virtual goods.
add a comment
Psycochild posted on crafting from a game designers point of view last week.
Crafting is typically broken down into the following steps:
1. Learn a recipe.
2. Collect resources.
3. Create the item.
4. Sell (or use) the item.
5. Longer term: Advance your skill
He notes the game design goals of each step:
Learn a recipe: Create achievements, appeal to socializers
Collect resources: Create demand for resources that give goals to players
Create the item: Provide inflows into the game economy
Sell (or use) the item: Create economy game
Advance your skill: Give players incentive to advance
For game designers, it is worth reading the whole thing.
Going viral without going down June 20, 2008Posted by jeremyliew in database, flixster, scalability.
As the social web evolves and platforms like Facebook and MySpace open up to applications, many companies and developers are rushing to get distribution to their millions of users by “going viral”. For the successful applications, this can often present a problem (a high-quality one for sure) – how do you actually scale your deployment to handle that growth?
At Flixster, we’ve been riding this growth curve for 2 years now – first with our destination site itself (www.flixster.com), and subsequently on our embedded applications on Facebook and MySpace. Across our properties, we now have over 1 million users logging in each day and we are approaching our 2 billionth movie rating. Like many others, we started out with just a single virtual server in a shared hosting environment. So how did we scale to where we are today?
The Holy Grail for scaling is “pure horizontal scaling” – just add more boxes to service more users. This tends to be relatively easy at the application layer – there are a multitude of cheap and simple clustering and load balancing technologies. The data layer is typically much more difficult to scale, and is where a lot of web startups fall down. High-volume applications simply generate too much traffic for any reasonably-priced database (I’ll assume you’re probably running MySQL as we are). So what are your options?
Buy yourself some time
The overriding mantra to everything we’ve done to scale our database has been: “avoid going to disk at all costs”. Going to disk to retrieve data can be orders of magnitude worse than accessing memory. You should apply this principle at every layer of your application.
Given that, the first thing to do is to throw in a good caching layer. The easiest way to scale your database is to not access it. Caching can give you a ton of mileage, and we still spend a lot of effort optimizing our caching layers.
If you can afford it, you can also buy a bigger box (RAM being the most important thing to upgrade). “Scaling up” like this can be effective to a point, but only buys you so much time because after all, it’s still a single database.
A replication setup can also buy you some time if you have a read-intensive workload and can afford to send some queries to a slave database. This has its problems though, the biggest of which is replication lag (slaves fall behind). Ultimately, replication can also buy you some time, but for most web application workloads, replication is a tool much better suited to solving high-availability problems than it is to solving scalability ones.
It’s time to break up
Eventually, you’re going to have to find a way to “scale out” at your database layer. Split up your data into chunks. Put the chunks on separate databases. This strategy is often called “sharding” or more generally “data partitioning” (I use the two interchangeably). It works because it reduces the workload (and price tag) for each server. It’s not trivial, but it is very doable.
There is a lot of literature out there on the technical details and challenges of sharding (see the resources section). At Flixster, we’ve followed many of the strategies described by LiveJournal, Flickr and others. One of the critical things for any startup however is figuring out when to do things.
Our primary trigger for deciding to shard a given piece of data is the size of the “active” or “working” set. It all comes back to the principle of never going to disk. All of our database servers have 32GB of memory, which we give almost entirely to the MySQL process. We try to fit most, if not all, of our active data on a given server into that space.
The ratio of active / total data will vary tremendously by application (for us it seems to be in the 10-20% range). One way to figure out if your active data is saturating your available memory is to just look at cycles spent waiting for I/O on your server. This stat more than anything else we monitor drives our partitioning decisions.
The other thing we look at for a given table is the raw table size. If a table becomes too big (in terms of # of rows or total data volume) to administer – i.e. we can’t make schema changes easily – we partition it. There’s no magic threshold that fits all applications, but for us we typically decide to shard a table if we expect it to reach 30-40 million rows.
It’s certainly easier to start off with a fully sharded architecture, but most applications do not (we certainly didn’t). In fact, I’d say that if you are spending a lot of time figuring out partitioning strategies before you even have any users, you’re probably wasting development resources. So how do you actually rip the engine out of the car while it’s running? Piece by piece and very, very carefully…
Crawl, walk, run
There are a variety of partitioning strategies, which we’ve employed incrementally as we’ve grown. Here are some of the things we’ve done (in ascending order of difficulty).
If you have a large table with a relatively small “hot spot”, consider putting the active data into a separate table. You will have some additional complexity managing the flow of data from the “active” table to the “archive” table, but at least you have split the problem a bit. This is the strategy we used early on for our movie ratings table, after realizing that 90% of the queries we were writing against it were looking for data from the last 30 days.
Vertical (or feature-based) Partitioning
Your application may have features that are relatively independent. If so, you can put each feature on a separate database. Since the features are independent, separating them shouldn’t violate too many assumptions in your application.
We did this pretty early on, and have had a lot of success with this approach. For example, movie ratings are a core feature that didn’t overlap too much (data-wise) with the rest of the database. Comments are another one. We’ve followed the same strategy for several other “features” and now have six separate feature databases.
This was a major step forward for us as it split our big problems into several smaller ones. You might not need to go any further…vertical partitioning may be sufficient. But, then again, you want to grow forever, right?
Horizontal (or user-based) Partitioning
Our success on Facebook drastically increased the load on our feature databases. Even our dedicated ratings database was struggling to keep up. A few months after our Facebook application launch, we deployed our first horizontal partition, separating different users’ ratings onto different physical databases.
One of the challenges of horizontal partitioning is in rewriting your data access code to figure out which database to use. With vertical partitions it’s relatively straightforward – which feature am I coding? With user-based partitioning, the logic can get much more complex. Another challenge in horizontal partitioning is the transition from your single data source into your partitions. The data migration can be painful. Extra hardware eases much of the pain, especially coupled with replication.
Following movie ratings, we have now horizontally partitioned a handful of other tables. We’ve also doubled the size of the partition cluster itself, going from four to eight master-slave pairs. We still use our vertically-partitioned feature databases, but they are under much less stress given the load absorbed by the horizontal partitions. And we continue to partition our high-volume tables on an as-needed basis.
Finally, some tips
• Start small, and bite things off in pieces that are manageable. Massive, several-month-long re-architectures rarely work well.
• Get some advice. We spent a good amount of time gleaning wisdom from the success of others (which they were kind enough to put online for everyone!). See the Resources section.
• Pick the best approach for your specific problems (but you have to know where your problems are – monitor EVERYTHING).
• You’ll never get there if you don’t start.
Bonus tip – come work @ Flixster!
If you’re a DBA and interested in working on these kinds of problems at a company that is already operating at scale, please send us a resume: jobs – at – flixster.com. We’re also hiring Java developers.
Hints, tips and cheats to better datamining June 17, 2008Posted by jeremyliew in datamining.
As web based product development and game development both become more iterative, better datamining and analysis becomes more and more important. But the data generated by users behavior can be almost overwhelming. How should a startup think about getting the most insight and value from their data?
Anand Rajaraman is a co-founder of Kosmix, a Lightspeed portfolio company, and also teaches a datamining class at Stanford. He knows a thing or two about the subject, and he suggests that more data usually beats better algorithms:
Different student teams in my class adopted different approaches to the [Netflix challenge] problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?
Team B got much better results, close to the best results on the Netflix leaderboard!! I’m really happy for them, and they’re going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I’m often suprised [sic] that many people in the business, and even in academia, don’t realize this.
Another fine illustration of this principle comes from Google. Most people think Google’s success is due to their brilliant algorithms, especially PageRank. In reality, the two big innovations that Larry and Sergey introduced, that really took search to the next level in 1998, were:
1. The recognition that hyperlinks were an important measure of popularity — a link to a webpage counts as a vote for it.
2. The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.
First generation search engines had used only the text of the web pages themselves. The addition of these two additional data sets — hyperlinks and anchortext — took Google’s search to the next level. The PageRank algorithm itself is a minor detail — any halfway decent algorithm that exploited this additional data would have produced roughly comparable results.
In a followup post, he notes that:
1. More data is usually better than more complex algorithms because complex algorithms don’t scale as well (computationally) and
2. More independent data is better than more of the same data, but if data was originally sparse, then more of the same data can help a lot too.
Mayank Bawa of Aster Data chimes in to say that running simple analysis over complete datasets is better than running more complex data over sampled datasets for two reasons:
1. The freedom of big data allows us to bring in related datasets that provide contextual richness.
2. Simple algorithms allow us to identify small nuances by leveraging contextual richness in the data.
In other words, since human behavior is complex, and some behavior crossmatches are rare, using a sample of data will cause some important but rare correlations to be lost into the noise.
He also points out that Google takes a similar approach to datamining.
This is good stuff.