Mate1's activity feed: Cassandra, Kafka, Netty, Varnish

Posted by Hisham Thu, 08 Dec 2011 04:25:00 GMT

At Mate1 we’ve just released our activity feed feature to part of our user base. The activity feed contains events based on people’s interactions with you (think you’re hot, like you, send you a message, etc.) and other events that happen in the system that we think interest you. We push these events from the web servers and other parts of the system (automatic image and message review systems, message generators, customer support, etc.) as soon as they occur into Kafka. We have a 3 node Kafka and Zookeeper cluster that holds most of the data for 2 weeks (some specialized topics are kept only for a few days). At the other end of the Kafka brokers we have several consumers that run in Tomcat. These consumers constantly pull the data from Kafka and run some business logic on it. They might decide to drop these events or further save them into Cassandra. The current Cassandra cluster is maintained over 4 nodes. In Cassandra every user has a single row for each tier of their activity feed. Events that make it into the activity feed are grouped into tiers, 4 of them, 1 being the most important events and 4 being the least. A fifth and final row stores the rolled up activity feed. This groups events by type and by user to create entries like “A, B, C, and D viewed your profile” or “A viewed your profile and liked you”. Roll-ups are done either on demand when a user logs in or periodically in the background by a roll-up daemon. Cassandra is wrapped and hidden behind an Netty Http application that accepts requests and hands back JSON objects representing the corresponding parts of user’s activity feed that was requested. We can get tiers, do paging, and mark items as read through this interface. We also use Cassandra to maintain all item (read, unread, per tier, etc.) counters. The web servers then load user’s activity feeds through this Http interface which is in fact a load balanced set up behind Varnish. If we experience high loads we can always enable caching in Varnish and avoid hitting Cassandra as much. In the future we could eliminate the web server as the middle man and directly fetch the feeds using JavaScript from the client (moving most of the work out into the browser). Its also worth mentioning that the Kafka consumers also produce data back into a Kakfa topic for logging and analytics purposes. These topics are consumed and we then push them, transformed / ETL’ed, into MySQL / data stores for analytics purposes.