GetGlue Engineering

Official blog for GetGlue Engineering Team

Follow us on Twitter: @GetGlueEng
Mashable Awards 2011 Breakout Startup Winner
What is GetGlue?

GetGlue's vision is to create a deeply personalized, social and connected experience around television, movies and sports.


Get our app for your phone


  • Jobs
  • rss
  • archive
  • Fast Range Queries with Redis

    Redis is already a pretty good choice for analytics tasks. Support for atomic counters and bitmaps make it an attractive option. What if we could make it even better by gaining the ability to quickly answer questions of the format “given the dataset D, how many of its values fall between a and b?”

    Some contrived forms of that question go like this:

    Given our last year of unique visits data, how many uniques did we have between January 1 and June 1 in 2012? How about that night we got frontpaged on Reddit - do we have data for April 3 at 9pm to 11pm UTC? Possible to drill down to 9:15 to 9:45?

    If we were storing this data in a row-based system with the intent of querying it on an ad-hoc basis, the table would contain a number of rows equal to the number of unique visits. Performing this type of one-off range query would eventually become infeasible after a while of storing visits without some sort of pruning or roll-up routine:

    SELECT COUNT(*) FROM uniques
      WHERE timestamp > DATETIME("2012-04-03T21:15:00Z")
      AND timestamp < DATETIME("2012-04-03T21:45:00Z")
    

    Achieving that queryability with Redis is pretty straightforward. The key is the Fenwick tree, originally discovered to store cumulative frequency tables in cryptography applications, but as it turns out it’s pretty useful for doing generic range queries. It will only contain as many nodes as there are possible unique values in the metric that is being measured, so in the case of a year’s worth of events at minute resolution, the maximum size would be 525,600 nodes, i.e., 1 node for each minute. The downside is that you need to know the largest possible value of your data set ahead of time. The upside is that Fenwick trees really shine when your possible range is high, high enough so that linear scans of the data are prohibitively expensive. If a tree has a max size of 1 billion elements, it will only need about 60 node accesses at most to perform a query.

    We can store these structures in Redis hashes, with the benefit of sparse data using less memory and it gives better random access times, which these structures do a lot of. It’s also possible to implement them with sorted sets, but they use a bit more memory than hashes do. So to see how they work, let’s build one in Python.

    First we define a query method that takes two values and returns the number of values in the dataset that falls between them:

    def query(key, a, b):
        return read(key, max(a, b)) - read(key, min(a, b))
    
    def read(key, event):
        ixs = []
        while event > 0:
            ixs.append(event)
            event = event & (event - 1)
        return sum(redis.hmget(key, *ixs))
    

    It’s implemented this way because the primitive read in this data structure actually returns the cumulative frequency of the event. The difference of the two cumulative frequencies returns the result we are actually looking for. Keeping with our example, if we wanted to query from April 3 at 9:15pm UTC to 9:45, we would first find that the start minute is the 135,195th minute and see 30 minutes later at 9:45 is the 135,225th. We then just execute this query to get our result, the number of unique visitors between those two times:

    query("uniques", 135196, 135226)
    

    Note the bump by 1: this structure doesn’t start at zero, it starts at 1, so we should offset accordingly.

    For all the power this gives us, Redis is not doing much work:

    HMGET uniques 135226 135224 135216 135200 135168 131072
    HMGET uniques 135196 135192 135184 135168 131072
    

    Read is a logarithmic operation and only takes a single Redis HMGET to compute. Write is logarithmic as well, but unfortunately requires more than one Redis HINCRBY. It’s optimized a little by doing it in a pipeline:

    def write(key, event, repeat, max):
        with redis.pipeline() as pipe:
            while event <= max:
                pipe.hincrby(key, event, repeat)
                event += event & -event
            pipe.execute()
    

    In write, event is the value being recorded, and repeat is the number of times it occured. It lets us specify that the event happened N more times as an alternative to writing the same event N times. The max parameter is the largest possible value that the structure can contain, so continuing with our example from earlier, it would be 525,600. Another thing to remember is that the range of data is (0, 525600] — the first minute in January would actually be 1. With that, here’s how we add 11 uniques on April 3 at 9:15pm UTC:

    write("uniques", 135196, 11, 525600)
    

    Running that sends this over the wire to Redis:

    HINCRBY uniques 135196 11
    HINCRBY uniques 135200 11
    HINCRBY uniques 135232 11
    HINCRBY uniques 135296 11
    HINCRBY uniques 135424 11
    HINCRBY uniques 135680 11
    HINCRBY uniques 136192 11
    HINCRBY uniques 137216 11
    HINCRBY uniques 139264 11
    HINCRBY uniques 147456 11
    HINCRBY uniques 163840 11
    HINCRBY uniques 196608 11
    HINCRBY uniques 262144 11
    HINCRBY uniques 524288 11
    

    That’s pretty much the gist of it. The important points to remember are:

    • Useful for performing range queries on datasets too large to be efficiently queried in other systems in an ad-hoc manner;
    • O(log(N)) read and write times, and read can be executed in a single Redis command;
    • O(N) storage space for a full tree;
    • Any totally-ordered dataset can be usefully stored and queried.
    • 1 month ago
    • 1 notes
    1 Comments
  • Autocomplete Search with Redis

    When we launched GetGlue HD, we built a faster and more powerful search to help users find the titles they were looking for when they want to check-in to their favorite shows and movies as they typed into the search box. To accomplish that, we used the in-memory data structures of the Redis data store to build an autocomplete search index.

    Search Goals

    The results we wanted to autocomplete for are a little different than the usual result types. The Auto complete with Redis writeup by antirez explores using the lexicographical ordering behavior of sorted sets to autocomplete for names. This is a great approach for things like usernames, where the prefix typed by the user is also the prefix of the returned results: typing mar could return Mara, Marabel, and Marceline. The deal-breaking limitation is that it will not return Teenagers From Mars, which is what we want our autocomplete to be able to do when searching for things like show and movie titles. To do that, we decided to roll our own autocomplete engine to fit our requirements.

    Building the Index

    Shows and movies in the autocomplete index have three properties:

    • key — a unique identifier for the resource in the system, like tv_shows/twin_peaks
    • title — the human-readable name of the resource, e.g. Twin Peaks
    • score — the popularity of the show on GetGlue

    Since we want to be able to autocomplete based on individual words in the title, we need to map possible users’ search terms to these properties. A combination of Redis sorted sets and hashes fit this need.

    To add an item to the index, we first want to get the possible search terms that would return that item. That means getting each prefix for each word in the title. For example, the prefixes for Twin Peaks are t, tw, twi, twin, p, pe, pea, peak, and peaks. Each prefix is the key name of a sorted set, with the set’s values being item keys, and the weights being item scores. So what happens in Redis when we add Twin Peaks to the index, with the key tv_shows/twin_peaks and score of 1000?

    ZADD t 1000 tv_shows/twin_peaks
    ZADD tw 1000 tv_shows/twin_peaks
    ZADD twi 1000 tv_shows/twin_peaks
    ZADD twin 1000 tv_shows/twin_peaks
    ZADD p 1000 tv_shows/twin_peaks
    ZADD pe 1000 tv_shows/twin_peaks
    ZADD pea 1000 tv_shows/twin_peaks
    ZADD peak 1000 tv_shows/twin_peaks
    ZADD peaks 1000 tv_shows/twin_peaks
    HSET $titles tv_shows/twin_peaks "Twin Peaks"
    

    We use the ZADD command to add keys to sorted sets, and then we maintain a hash associating a title to each key with HSET.

    Searching the Index

    Given a search term, we can use sorted set commands to retrieve the results quickly. For example, given the search term “twin”, we can issue a ZREVRANGE command to get the top N items by score with twin somewhere in their title. When run inside of redis-cli, the results for the top 5 look something like this:

    redis 127.0.0.1:6379> zrevrange twin 0 4 withscores
     1) "tv_shows/twin_peaks"
     2) "1000"
     3) "tv_shows/doctor_who_twin_dilemma"
     4) "270"
     5) "tv_shows/please_twins"
     6) "250"
     7) "tv_shows/twins"
     8) "44"
     9) "tv_shows/cramp_twins"
    10) "30"
    

    So how can we deal with multiple search terms? That’s where ZINTERSTORE comes in. This command stores the intersection of multiple sorted sets in a single set. Say the user wasn’t looking for Twin Peaks, but Doctor Who: The Twin Dilemma and they refine their search to “twin dil”. We can get the intersection of the keys from the two terms twin and dil with ZINTERSTORE, and then use ZREVRANGE as usual to retrieve the matches:

    redis 127.0.0.1:6379> zinterstore $tmp 2 twin dil aggregate max
    (integer) 1
    redis 127.0.0.1:6379> zrevrange $tmp 0 4 withscores
    1) "tv_shows/doctor_who_twin_dilemma"
    2) "270"
    

    This approach works for any number of search terms and also allows the terms to be issued out of order, so searching for “story never” and “never story” are the same search. After our results are selected, we can use HMGET to read the titles of all the result keys back:

    redis 127.0.0.1:6379> hmget $titles tv_shows/doctor_who_twin_dilemma
    1) "Doctor Who: The Twin Dilemma"
    

    Refining the Search Results

    Ranking the results just by their score is an OK way to do it, but it does not always return the best results. The outlier items will nearly always show up at the top of the results even when the user is not searching for them. There are a few things that can be done with the results from Redis before surfacing them to the user:

    • Work with the logarithms of the items’ scores
    • Modify score based on how close the search query is to the actual title
    • Modify score based on searching user’s sentiment toward the item
    • Penalize unused search terms

    Conclusion

    Redis is an excellent tool and store for working with data structures over the network. The API’s tools are accessible enough to prototype the entire autocomplete index inside of the command-line interface without writing a line of code, and robust enough to support the activity of more than 3 million users searching over hundreds of thousands of objects, millions of people, and thousands of stickers in real time.

    Get the Code

    The index used at GetGlue is written in Python, backed by Redis, and exposed via HTTP with a small Flask application. It’s not yet open source, but this is a simplified version which functions very similarly to the one we use in production. This depends on redis-py to work.

    • 5 months ago
    0 Comments
  • GetGlue’s Front-end Stack

    Earlier this year GetGlue began work on the new GetGlue website. It was a chance for a clean engineering slate, allowing us to put all our knowledge together to begin a new project on strong footing. It was a chance to use all of the most powerful client-side technologies to allow us to work efficiently and quickly.

    Our old front-end stack consisted of a PHP backend that rendered XML documents with XSLT. We used vanilla CSS to style our pages and plain JavaScript to add sugar to the UI. We relied on jQuery for DOM manipulation and various jQuery plugins to increase the usability of the site. We optimized our JavaScript files for deployment via our own concatenation and optimization script and served it as one canonical file.

    The new GetGlue website is taking advantage of many new powerful front end tools.

    The tl;dr list of tools:

    • CoffeeScript
    • SASS
    • Compass
    • jQuery
    • Backbone.js
    • Underscore.js
    • Mustache.js
    • RequireJS

    Preprocessors: CoffeeScript and SASS

    Before any code was written we made the decision to use preprocessors as part of our tool-chain. Specifically we decided to use CoffeeScript and SASS to code and style the new GetGlue website.

    CoffeeScript has enabled us to develop features at a break-neck pace. It strips away a great deal of the boilerplate code that exists in JavaScript. For example when referencing a property on a JavaScript object:

    view = response.items[0].title 

    There’d be instances when the items array would be empty. With JavaScript we would have to verify its existence via:

    if (response.items && response.items != null) { view = response.items[0].title } 

    The CoffeeScript equivalent to the above code involves just one additional character:

    view = response.items?[0].title 

    This is one reason we have found CoffeeScript to be incredibly powerful, enabling us to be twice as productive.

    SASS has also allowed us to code efficiently and quickly. We are making great use of SASS’a nested selectors as well as mixins. Those two features alone have made coding CSS a joy again and not a pain. Through mixins we can keep our CSS more DRY-compliant, and with variables it is a breeze to update color configurations.

    In addition to SASS we are using Compass. We have found Compass to be a great utility belt of common SASS mixins we would have otherwise made ourselves. Rather than having to manually type every vendor prefix we leverage a Compass mixin to automate the process. For example:

    @include border-radius(4px) 

    Compiles to

    -webkit-border-radius: 4px -moz-border-radius: 4px -o-border-radius: 4px border-radius: 4px 

    We get to save time and sanity. It’s a wonderful addition to our toolkit.

    Libraries

    At GetGlue we also take advantage of a number of JavaScript libraries.

    It almost goes without saying that we use jQuery for a whole slew of things. DOM selection, manipulation, and everything in between.

    More exciting however is our great use of Underscore.js and Backbone.js. Our entire front-end UI has been built with Backbone Views, Models, Collections, and Routers.

    Each page of GetGlue is comprised of one general PageView which in turn contains multiple sub-views (and usually those sub-views have sub-views - turtles all the way down).

    All of our data is represented in Backbone Models and Collections with our views bound to change events to re-render themselves.

    And naturally we employ underscore.js’ functions to make common tasks more enjoyable to perform. Varying from _.map to _.throttle - it makes our coding lives an absolute joy.

    Mustache

    Currently GetGlue is rendered entirely client-side and because of that we make extensive use of Mustache templates.

    We’ve found the (mostly) logic-less Mustache templates easy to work with and they are a cornerstone of our new website. Without any templates we’d have nothing to render and there’d be no website for you to see.

    By keeping (most) of the view’s logic out of the template we’re able to focus on just the structure of the view and not have to worry about anything else. This allows for fear-less template refactors as nothing else is affected.

    RequireJS

    When developing a large JavaScript application it can become hard to keep track and manage all the moving pieces that are required to make the application run. To that end we have turned to RequireJS as our module loader and build tool.

    Each discrete section of code lies in its own JavaScript file, attached to its own namespace, and only included by another file when explicitly requested. By adhering to these strict rules we avoid unexpected behavior and can code with confidence.

    Having code for one specific functionality in its own JavaScript file makes for great organizational clarity and lends a certain amount of intuition to our code base: by that I mean when file B is a subclass of file A the file structure mimics that behavior.

    For example on GetGlue when you open an item from the Guide you open what we internally refer to as a Card. Each of those types of popovers are derivatives of a Card. In our file tree we have the card.coffee base class and in the directory cards/ lay all of the card subclasses.

    This intuitive structure makes not only for development delight but also allows for easy on boarding of new hires as the system is easy to understand.

    We also use RequireJS as our primary build tool for our JavaScript and SASS.

    The way you include JavaScript files into other modules is through RequireJS’ require() function. As part of its build process RequireJS traces all require() calls and inlines them into one JavaScript file that we then serve to the browser. (Note: RequireJS allows for other build configurations).

    RequireJS will also take all CSS import statements and inline them into a single CSS file.

    By the end of the build process we’ve slimmed our JS and CSS files to one a piece making for a quick download by the client and a speedy experience.

    Conclusion

    We’re very proud of what we’ve accomplished with the new GetGlue website.

    By switching to CoffeeScript and SASS we were able to develop quickly and efficiently. We were able to create features and tweak them without much effort. It made the entire development experience much more enjoyable.

    Giving our site some Backbone.js made UI transition and state management easier than ever before. I will never again dip my hand into the DOM to find state. Thar be dragons.

    RequireJS is a beautiful tool that I wish I had known about years ago. Its module management is intuitive and its build tool is very powerful.

    We’ve had a great time building the new GetGlue website. We hope you enjoy it as much as we do.

    • 6 months ago
    • 1 notes
    1 Comments
  • Analyzing Social Television: Check-ins vs Nielsen Rating

    Introduction

    A couple weeks ago we published our first data science blog post where we investigated the connection between movie check-ins on GetGlue and box office draw. In this post we’ll look at how GetGlue check-ins align with a traditional metric of TV performance: the Nielsen ratings.

    Check-ins vs Nielsen Rating: Single Show

    For this study we looked at check-ins for episodes since the beginning of the year that we also had Nielsen ratings for. Our dataset was comprised of 367,369 check-ins, 1,649 episodes, and 237 shows. We removed episodes for shows with stickers to account for promotional bias and normalized the check-in counts to account for growth in the GetGlue user base. By doing this we were better able to compare episodes with check-ins in May to episodes with check-ins in January, for example.

    We’ll start by looking at a single show — Big Bang Theory — a popular comedy on CBS. We chose this show because we had many Nielsen ratings for it and it’s easy to locate in the final graph for this section. Other shows that we plotted using this method correlate equally as well.

    Big Bang Theory

    The R2 for the trend line is 0.88. This indicates a very good fit for our limited sample size and tells us that there is indeed a strong relationship between check-ins and Nielsen rating, at least in this case. The mean R2 was 0.69 with a standard deviation of 0.26 for shows with a relatively significant amount of data (more than 10 episodes and more than 500 check-ins on average).

    Check-ins vs Nielsen Rating: Broadcast Comedies

    Now that we’ve seen that check-ins and Nielsen ratings correlate for a single show, let’s look at how all broadcast comedies correlate. getglue_comediesThe R2 for the trend line is 0.55. Not as good as a single show, but still a fairly strong correlation. Assuming the Nielsen ratings are accurate, this tells us that the number of check-ins a show receives relative to its audience size is made up of many different factors. Even shows within the same genre may have different check-in patterns.

    Check-ins vs Nielsen Rating: All Genres

    Next we’ll look at all episodes across different genres and networks. Our hunch was that we would see more variation in the data, but that individual genres would form distinct curves. Indeed, this is what we found. This chart shows the number of check-ins versus the Nielsen rating for that episode. The color of each point on the graph represents the genre of the show for that episode and the shape represents whether the show was on broadcast or cable television. getglue_genresThere is a lot to gleam from this chart, especially with regard to way the data seems to form clear clusters of TV shows.

    We noticed several distinct groups that formed when we made the plot:

    • Supernatural teen dramas include Smallville, Vampire Diaries, and Supernatural — all of which are on cable except for Being Human, which is on SyFy. Tosh.0 and South Park — two cable comedy shows — appear in this cluster as well.
    • Teen dramas include Gossip Girl, Greek, Skins, 90210, and One Tree Hill.
    • Family sitcoms include How I Met Your Mother, Modern Family, Big Bang Theory, and Mr. Sunshine.
    • Crime dramas include NCIS, NCIS: Los Angeles, Criminal Minds, Criminal Minds: Suspect Behavior, The Mentalist, Castle, CSI, Body of Proof, Blue Bloods, and Hawaii Five-0.
    • Music/Dance reality shows include American Idol and Dancing with the Stars.

    It is interesting to see the consumption habits of users on GetGlue. For example, teen dramas on GetGlue drive the same amount of engagement as Crime dramas, even though the estimated audience size is much lower for teen dramas compared to crime dramas. One possibility is the fact that GetGlue users are young and tech-savvy and not likely to fit into the crime drama demographic. It may also be due to the fact that young people are more likely to watch on DVR/Internet and the Nielsen ratings fail to capture that portion of the audience. Another reason is there could be something inherent about teen dramas that cause people to check-in more — possibly because there is more to talk about.

    The difference between broadcast and cable TV consumption is interesting as well. Most of the low rated/low check-in shows tend to be female oriented cable reality shows such as Millionaire Matchmaker, Kate Plus 8, Real Housewives, etc. The highly engaged shows on cable tend to be Dramas and Comedies.

    Check-ins vs Nielsen Rating: Men vs Women

    Now that we’ve looked at genre, what happens if we breakdown the data by another factor, say gender? The next chart is the same as the last one, but instead of genres we colored the episodes by whether the show was mainly watched by men, mainly watched by women, or was watched roughly equally by both men and women. getglue_genderThe first thing to notice is that there is a lot of pink. Although the GetGlue active user base is about 50/50 male/female 62% of the check-ins in our data set came from women. One of the reasons for the disparity may be that we did not include sporting events in the list of shows. Another reason may be that men are embarrassed to check-in to shows that are thought of as feminine and are therefore underrepresented in the sample. Women, on the other hand, do not feel the same way about shows that are considered masculine. Lastly, women may simply watch more TV and/or enjoy checking-in more. In addition to the makeup of users, the chart tells us that Men tend to favor comedies and sci-fi shows while women tend to favor dramas and reality shows.

    We also thought it would be fun to list the top shows by male to female ratio. We only looked at shows that averaged more than 500 check-ins an episode to weed out some of the more obscure shows.

    Top shows for women

    1. The Bachelorette
    2. Grey’s Anatomy
    3. Dancing with the Stars
    4. The Bachelor
    5. Vampire Diaries
    6. Pretty Little Liars
    7. Inside The Royal Wedding
    8. Real Housewives of Beverly Hills/Atlanta
    9. Secret Life of the American Teenager
    10. Gossip Girl

    Top shows for men

    1. Stargate Universe
    2. The Cape
    3. Archer
    4. V
    5. Lights Out
    6. Smallville
    7. South Park
    8. Human Target
    9. Outsourced
    10. Perfect Couples

    Yes, men like watching Perfect Couples apparently — they made up a whopping 60% of the check-ins. Check-ins from women outnumbered those from men almost 5 to 1 for the top women’s show, The Bachelorette and for the top men’s show, Stargate Universe, check-ins from men outnumbered those from women almost 3 to 1.

    Conclusion

    Solely looking at the Nielsen rating of an episode won’t tell you how many check-ins it received. However, when looking at more variables such as the genre of the show and the number of male and female viewers we can start to build an accurate prediction model.

    We are excited about using GetGlue data to provide insights into social entertainment. This is only the first of many blog posts that will involve TV. As we gain more users and more data our insights will only continue to get better. Let us know your thoughts and if there is anything in particular that you would like to see us analyze in the future. Stay tuned for more!

    • 1 year ago
    0 Comments
  • Analyzing Movies at GetGlue

    We are excited to announce GetGlue’s first blog post solely focused on analyzing all of the data we’ve collected so far.  We hope to be the go-to source for cool and interesting insights into social entertainment (think OkCupid style of charts). Our first post will be focused on last week’s movie data.

    We’ve decided to start things off with rankings of the top movies, which we’ll continue to do every week. Top Movies Unsurprisingly, The Hangover: Part II, which opened last weekend, takes the top spot in our first top movies chart. We had been tracking the enormous amount of GetGlue user interactions with the movie before it released and could tell that it was going to fair well in the box office.

    Now let’s take a look at how well GetGlue check-ins corresponded with box office revenue.

    Weekend Check-ins vs Weekend Box Office Gross As you can see, there is a clear correlation between check-ins and box office dollars. The gray dotted line represents the average relationship between the two. For the mathematically inclined, to get the trend line we performed a simple linear regression and obtained an R2 value of 0.95. In other words, 95% of the variance in the data was explained by the trend line. A perfect correlation would have an R2 value of 1.0. The biggest deviation from the trend comes from Something Borrowed, which was expected to have a much higher weekend gross. One possible reason for variance in the data is the availability and promotion of stickers, which we realize can influence the amount of check-ins.

    So now that we know that the number of check-ins to a movie pretty much corresponds to the revenue it’ll receive in theaters, can we predict how well a movie is going to do? In this next chart, we plotted the number of pre-release interactions (combination of visits, check-ins, likes, etc) of movies against their opening box office revenues. Pre-release Interactions vs Opening Gross This chart shows that GetGlue is a fairly good predictor of how well a movie is going to do. Again, we included the trend line. This time we obtained an R2 value of 0.85. The biggest deviation from the trend came from Fast Five, which based on the pre-release interactions was expected to gross much lower in its opening weekend than it actually did. Another thing to note is that we did not normalize the data for user growth or feature changes to the site, which also would have affected the number of interactions.

    The last chart shows the number of movie check-ins by weekday. Movie Check-ins By Weekday As expected, the weekend grabs the lion’s share of check-ins, but surprisingly Sunday is the biggest movie day, not Friday or Saturday. One thing we didn’t do, however, is show where the check-ins are coming from. We expect most of the check-ins to come from in-theater viewings, but there are also check-ins coming from DVD and internet viewings.

    Stay tuned for our next blog post, where we’ll delve into TV.

    • 1 year ago
    0 Comments
© 2012–2013 GetGlue Engineering