Reddit is a gold mine for every programmer. Submissions will allow you to keep
up to date with the latest news and developments, and comments will let you learn and improve.
I wanted to explore the relationships between programming languages subreddits, and their
behaviors for a long time. Recently I stumbled on a reddit dataset maintained by
archived all of reddit (and still is). I downloaded more than 100 GB of data (the last 3 months), and started my analysis.
This post is focused only on the results, and not the code (maybe for another post?).
Languages are ordered by subscribers count. I may have forgotten your favorite language (please contact me, and I'll add it right away).
Relative Word frequency
This visualization shows you the 100 most frequent words (less on your phone) in each subreddit, relative to the global mean. For instance, the word "learnpython" has a frequency of 0.2% over the Python subreddit while having a mean frequency of 0.01% over all programming language subreddits. Which means that its relative frequency is 0.19%. The surface of the bubbles is proportional to this relative frequency, and the color is darker as the relative frequency is higher.
- Python: This subreddit is quite educational! Words like "help", "learn", "questions", "encourage", are way more frequent.
- Java & C#: These languages have a reputation of enterprise usage. And it is verified with words like "company", "business", "developer" and "customer" that are indeed quite prevalent.
- Go: Can't escape the biggest controversy of Go: "generic(s)"
- Php: Comments contains way more "shit" than other languages. I'll let you draw conclusions as to why...
Commenters in Common
This matrix shows you the common commenters by programming subreddits. Each matrix element shows you the percentage of commenters in the row language also commenting in the column language. This visualization let us see the relationships between programming languages. We can make some simple observations:
- Python: The biggest language subreddit (in subscribers count) attracts commenters from almost every other subreddits (except Objective-C!).
- Swift: This language is pretty self-centered. R too, but some of its users also comment in the Python subreddit.
- Rust: A medium sized subreddit (in subscribers count), but surprisingly, it attracts commenters from a lot of subreddits.
Lets display the average comment length by subreddit. I don't know about you, but I expected the functional languages at the top of the rankings. And they are, with Scala and Ocaml. I also expected the length of comments to be somewhat proportional to subscribers count, as larger communities tends to have more "shallow" comments. You can somewhat see that effect on the biggest subreddits.
This last visualization shows you the average score for submissions. Using only the last 3 months, the data is maybe a bit too small for the smaller subreddits (Ocaml had only 83 submissions in 3 months). But nonetheless, we can see that if you need some internet points, you should probably submit in the Rust and Haskell subreddits!
That's it for our analysis! It is very simple and basic, and I'm sure there is a lot more to extract and visualize. Programming subreddits, with their comments and submissions, are a great to understand programming languages dynamics. I hope you enjoyed the visualizations. The tools I used: D3.js for the visualizations, SQLite to store the last 3 months of reddit, and Python to do the analysis.