All the open source code in GitHub now shared within BigQuery: Analyze all the code!

Felipe Hoffa
4 min readJun 29, 2016
8

All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.

The pipeline mirrors code from:

  • Projects that have a clear open source license.
  • Forks and/or un-notable projects not included.
  • Nevertheless, it represents terabytes of code.

Official sources:

In depth analysis

I’m waiting for your contributions — I will add them here:

A series of posts by Robert Kozikowski:

Tips

  • Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
  • How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
  • I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.

Visualizations

--

--

Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.