Programming Subreddits

Posted SEPT 17, 2017
Python
Stats
Sql

Introduction

Reddit is a gold mine for every programmer. Submissions will allow you to keep up to date with the latest news and developments, and comments will let you learn and improve. I wanted to explore the relationships between programming languages subreddits, and their behaviors for a long time. Recently I stumbled on a reddit dataset maintained by Stuck_In_the_Matrix. He archived all of reddit (and still is). I downloaded more than 100 GB of data (the last 3 months), and started my analysis. This post is focused only on the results, and not the code (maybe for another post?). Let's start!
Languages are ordered by subscribers count. I may have forgotten your favorite language (please contact me, and I'll add it right away).

Relative Word frequency

Python
JavaScript
Java
C++
PHP
C#
Ruby
Go
C
Haskell
Rust
Swift
Scala
Clojure
Perl
Elixir
R
Erlang
Kotlin
Lua
Elm
Objective-C
Ocaml

This visualization shows you the 100 most frequent words (less on your phone) in each subreddit, relative to the global mean. For instance, the word "learnpython" has a frequency of 0.2% over the Python subreddit while having a mean frequency of 0.01% over all programming language subreddits. Which means that its relative frequency is 0.19%. The surface of the bubbles is proportional to this relative frequency, and the color is darker as the relative frequency is higher.

While doing the frequency count, I removed common english stop words, and all the words corresponding to the subreddit language (ie for /r/javascript I removed the words "javascript" and "js"), in order to not pollute the visualization. Already, with this bubble graph, we can make some observations:

Commenters in Common

This matrix shows you the common commenters by programming subreddits. Each matrix element shows you the percentage of commenters in the row language also commenting in the column language. This visualization let us see the relationships between programming languages. We can make some simple observations:

Comment Length

Python
JavaScript
Java
C++
PHP
C#
Ruby
Go
C
Haskell
Rust
Swift
Scala
Clojure
Perl
Elixir
R
Erlang
Kotlin
Lua
Elm
Objective-C
Ocaml

Lets display the average comment length by subreddit. I don't know about you, but I expected the functional languages at the top of the rankings. And they are, with Scala and Ocaml. I also expected the length of comments to be somewhat proportional to subscribers count, as larger communities tends to have more "shallow" comments. You can somewhat see that effect on the biggest subreddits.

Submission scores

Python
JavaScript
Java
C++
PHP
C#
Ruby
Go
C
Haskell
Rust
Swift
Scala
Clojure
Perl
Elixir
R
Erlang
Kotlin
Lua
Elm
Objective-C
Ocaml

This last visualization shows you the average score for submissions. Using only the last 3 months, the data is maybe a bit too small for the smaller subreddits (Ocaml had only 83 submissions in 3 months). But nonetheless, we can see that if you need some internet points, you should probably submit in the Rust and Haskell subreddits!

Conclusion

That's it for our analysis! It is very simple and basic, and I'm sure there is a lot more to extract and visualize. Programming subreddits, with their comments and submissions, are a great to understand programming languages dynamics. I hope you enjoyed the visualizations. The tools I used: D3.js for the visualizations, SQLite to store the last 3 months of reddit, and Python to do the analysis.