Querying AI and Cloud Trends: Azure and OpenAI Growth Slows, Amazon Growth Peaked in June
Last Updated on September 2, 2024 by Editorial Team
Author(s): Jonathan Bennion
Originally published on Towards AI.
Cutting through the AI hype to query actual developer usage (as new repos, so with presumptions) for prioritization of safety tools and partnerships.
TLDR (with caveats noted below):
- Public AI repos now appear as linear growth, not exponential (surge in March 2024 followed by rapid decline, now slower but steady).
- Azure/OpenAI public repo dominance: Azure shows 20x more new repos each month than the next leading hyperscaler, with OpenAI usage also dominating.
- Amazon Bedrock public repo growth may have peaked in June 2024 (slightly exponential until then).
Introduction β what did I query?
I leveraged GitHub repository creation data to analyze adoption trends in AI and cloud computing adoption. Code below, analysis follows.
Note on caveats:
Despite obvious bias and limitations (public packages and public repos containing only the names of these packages), this method offers a unique view to developer adoption. Google Cloud and/or Microsoft formerly enabled querying of code within pages, which would have enabled a count of distinct import statements, but at some point recently this was disabled, therefore only leaving the repo names as queryable.
While imperfect, looking at repo creation provides enough data to challenge prevailing market narratives.
First, the notebook setup:
Itβs only possible to use Google Cloud Platform (GCP) and BigQuery to access and query the GitHub data archive, so installed these packages (used colab initially, now parked in github).
# Install packages
!pip install -q pandas seaborn matplotlib google-cloud-bigquery
# Imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from google.cloud import bigquery
from google.oauth2 import service_account
Query from GCP out of BigQuery:
The following SQL extracts relevant data by categorizing repositories related to specific AI and cloud technologies, then aggregates repository creation counts by creation month.
Dependent on some manual investigation of the right python package names.
query = """
WITH ai_repos AS (
SELECT
repo.name AS repo_name,
EXTRACT(DATE FROM created_at) AS creation_date,
CASE
WHEN LOWER(repo.name) LIKE '%bedrock%' THEN 'bedrock'
WHEN LOWER(repo.name) LIKE '%vertex%' THEN 'vertex'
WHEN LOWER(repo.name) LIKE '%openai%' THEN 'openai'
WHEN LOWER(repo.name) LIKE '%anthropic%' THEN 'anthropic'
WHEN LOWER(repo.name) LIKE '%langchain%' THEN 'langchain'
WHEN LOWER(repo.name) LIKE '%azure%' THEN 'azure'
WHEN LOWER(repo.name) LIKE '%llamaindex%' THEN 'llamaindex'
WHEN LOWER(repo.name) LIKE '%neo4j%' THEN 'neo4j'
WHEN LOWER(repo.name) LIKE '%pymongo%' THEN 'pymongo'
WHEN LOWER(repo.name) LIKE '%elasticsearch%' THEN 'elasticsearch'
WHEN LOWER(repo.name) LIKE '%boto3%' THEN 'boto3'
WHEN LOWER(repo.name) LIKE '%ayx%' THEN 'ayx'
WHEN LOWER(repo.name) LIKE '%snowflake-connector-python%' THEN 'snowflake'
WHEN LOWER(repo.name) LIKE '%c3-toolset%' THEN 'c3ai'
WHEN LOWER(repo.name) LIKE '%dataiku-api-client%' THEN 'dataiku'
WHEN LOWER(repo.name) LIKE '%salesforce-einstein-vision-python%' THEN 'salesforce_einstein'
WHEN LOWER(repo.name) LIKE '%qlik-py-tools%' THEN 'qlik'
WHEN LOWER(repo.name) LIKE '%palantir-foundry-client%' THEN 'palantir_foundry'
WHEN LOWER(repo.name) LIKE '%cuda-python%' THEN 'nvidia_cuda'
WHEN LOWER(repo.name) LIKE '%openvino%' THEN 'intel_openvino'
WHEN LOWER(repo.name) LIKE '%clarifai%' THEN 'clarifai'
WHEN LOWER(repo.name) LIKE '%twilio%' THEN 'twilio'
WHEN LOWER(repo.name) LIKE '%oracleai%' THEN 'oracle_ai'
ELSE 'other'
END AS keyword_category
FROM
`githubarchive.day.20*`
WHERE
_TABLE_SUFFIX >= '240101'
AND _TABLE_SUFFIX NOT LIKE '%view%'
AND type = 'CreateEvent'
AND repo.name IS NOT NULL
AND (
LOWER(repo.name) LIKE '%bedrock%'
OR LOWER(repo.name) LIKE '%vertex%'
OR LOWER(repo.name) LIKE '%openai%'
OR LOWER(repo.name) LIKE '%anthropic%'
OR LOWER(repo.name) LIKE '%langchain%'
OR LOWER(repo.name) LIKE '%azure%'
OR LOWER(repo.name) LIKE '%llamaindex%'
OR LOWER(repo.name) LIKE '%neo4j%'
OR LOWER(repo.name) LIKE '%pymongo%'
OR LOWER(repo.name) LIKE '%elasticsearch%'
OR LOWER(repo.name) LIKE '%boto3%'
OR LOWER(repo.name) LIKE '%ayx%'
OR LOWER(repo.name) LIKE '%snowflake-connector-python%'
OR LOWER(repo.name) LIKE '%c3-toolset%'
OR LOWER(repo.name) LIKE '%dataiku-api-client%'
OR LOWER(repo.name) LIKE '%salesforce-einstein-vision-python%'
OR LOWER(repo.name) LIKE '%qlik-py-tools%'
OR LOWER(repo.name) LIKE '%palantir-foundry-client%'
OR LOWER(repo.name) LIKE '%cuda-python%'
OR LOWER(repo.name) LIKE '%openvino%'
OR LOWER(repo.name) LIKE '%clarifai%'
OR LOWER(repo.name) LIKE '%twilio%'
OR LOWER(repo.name) LIKE '%oracleai%'
)
)
SELECT
FORMAT_DATE('%Y-%m', creation_date) AS month,
keyword_category,
COUNT(DISTINCT repo_name) AS new_repo_count
FROM
ai_repos
GROUP BY
month, keyword_category
ORDER BY
month, keyword_category
"""
Then extract, load, transform, etc..
Just created a pivot table with the right format..
# Query output to DF, create pivot
df = client.query(query).to_dataframe()
df['month'] = pd.to_datetime(df['month'])
df_pivot = df.pivot(index='month', columns='keyword_category', values='new_repo_count')
df_pivot.sort_index(inplace=True)
# Remove the current month to preserve data trend by month
df_pivot = df_pivot.iloc[:-1]
Next, plotted the data:
First time Iβd tried this, Iβd had to throw Azure to a secondary axis since it was 20x that of the next repo.
# Define color palette
colors = sns.color_palette("husl", n_colors=len(df_pivot.columns))
# Create plot
fig, ax1 = plt.subplots(figsize=(16, 10))
ax2 = ax1.twinx()
lines1 = []
labels1 = []
lines2 = []
labels2 = []
# Plot each keyword as a line, excluding 'azure' for separate axis
for keyword, color in zip([col for col in df_pivot.columns if col != 'azure'], colors):
line, = ax1.plot(df_pivot.index, df_pivot[keyword], linewidth=2.5, color=color, label=keyword)
lines1.append(line)
labels1.append(keyword)
# Plot 'azure' on the secondary axis
if 'azure' in df_pivot.columns:
line, = ax2.plot(df_pivot.index, df_pivot['azure'], linewidth=2.5, color='red', label='azure')
lines2.append(line)
labels2.append('azure')
# Customize the plot
ax1.set_title("GitHub Repository Creation Trends by AI Keyword", fontsize=24, fontweight='bold', pad=20)
ax1.set_xlabel("Repo Creation Month", fontsize=18, labelpad=15)
ax1.set_ylabel("New Repository Count (Non-Azure)", fontsize=18, labelpad=15)
ax2.set_ylabel("New Repository Count (Azure)", fontsize=18, labelpad=15)
# Format x-axis to show dates nicely
ax1.xaxis.set_major_formatter(DateFormatter("%Y-%m"))
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')
# Adjust tick label font sizes
ax1.tick_params(axis='both', which='major', labelsize=14)
ax2.tick_params(axis='both', which='major', labelsize=14)
# Adjust layout
plt.tight_layout()
# Create a single legend for both axes
fig.legend(lines1 + lines2, labels1 + labels2, loc='center left', bbox_to_anchor=(1.05, 0.5), fontsize=12)
# Adjust subplot parameters to give specified padding
plt.subplots_adjust(right=0.85)
Results were interesting β since each month shows new repos created, Azure was exponential until March 2024, then declined quickly β is now linear growth since May 2024.
Re-plotted the data for clarity on smaller movements:
With the top 3 repos removed, itβs easier to see the scale β Amazon Bedrock clearly shows steadier adoption but appears to peak in June 2024. Note that some packages are not meant to show adoption, since these are public packages (e.g. Snowflake, Nvidia CUDA), and public repos.
# Isolate the top 3 to remove
top_3 = df_pivot.mean().nlargest(3).index
df_pivot_filtered = df_pivot.drop(columns=top_3)
fig, ax = plt.subplots(figsize=(16, 10))
for keyword, color in zip(df_pivot_filtered.columns, colors[:len(df_pivot_filtered.columns)]):
ax.plot(df_pivot_filtered.index, df_pivot_filtered[keyword], linewidth=2.5, color=color, label=keyword)
ax.set_title("GitHub Repository Creation Trends by AI Keyword (Excluding Top 3 Packages)", fontsize=24, fontweight='bold', pad=20)
ax.set_xlabel("Repo Creation Month", fontsize=18, labelpad=15)
ax.set_ylabel("New Repository Count", fontsize=18, labelpad=15)
ax.xaxis.set_major_formatter(DateFormatter("%Y-%m"))
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
ax.tick_params(axis='both', which='major', labelsize=14)
# Adjust layout
plt.tight_layout()
# Place legend outside the plot
ax.legend(loc='center left', bbox_to_anchor=(1.05, 0.5), fontsize=12)
# Adjust subplot parameters to give specified padding
plt.subplots_adjust(right=0.85)
plt.show()
Takeaways:
- Very large disparity between the smaller packages and those from βBig Techβ.
- Azure and OpenAI dominate but growth is slowed.
- Amazon may have peaked in June 2024.
More to come, stay tuned on more parts to this analysis (follow me for more updates)
FYI the dataframe is below, showing where obvious package names might not reflect the entire usage of the tool (e.g. Nvidia, Snowflake) β note (again) the many biases and caveats (one repo might contain x scripts etc), so this assumes a new (and public) repo is growth.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI