Anatomy of OpenAI's Developer Community¶
OpenAI has an official developer community hosted by Discourse which is the centre place of people seeking help and conversations about OpenAI's APIs, ChatGPT, Prompting and more.
The forum was launched on March 2021 and since then has seen 100,000+ postsby over 20,000 users.
Given the size and concentration of topics on the forum, it is a great resource for understanding the general sentiment of developers, identify common problems and rabbit holes users face and gather feedback on OpenAI products.
In order to get deeper insights to developer experience and shared sentiment about certain products, we downloaded all the posts from common categories from the forum, namely. The following categories and their relevant sub-categories are included:
- API
- API/Bugs
- API/Deprecations
- API/Feedback
- GPT Builders
- GPT Builders/Chat-Plugins
- GPT Builders/Plugin-Store
- Prompting
- Community
- Documentation
We created a dataset of all posts and discussions in the above categories which took place on the forum till 28th February 2024.
- 🤗 HuggingFace Link
But... why?¶
We believe there's a lot to learn from what people are struggling with, the developer sentiment over experience with using OpenAI's products.
This dataset was made so that we could answer these questions. There's a lot potential in learning from OpenAI's mistakes and successes.
We at Julep would love to hear what you built from the dataset! Hit us up on X/Twitter or email.
Getting data from Discourse¶
Every Discourse Discussion returns data in JSON if you append .json
to the URL.
- Discussion URL:
https://community.openai.com/t/{discussion_id}
- Discussion in JSON:
https://community.openai.com/t/{discussion_id}.json
- Discussion in Markdown:
https://community.openai.com/raw/{discussion_id}
Raw data was gathered into a single JSONL file by automating a browser using Playwright.
Let's walk through how the dataset was made and then showcase some initial trends we noticed.
Feature Engineering¶
Brief walkthrough to engineering the features.
Since each row had one Discussion and each Discussion had multiple Posts in a thread, the dataset needed to be normalised to the post
level; which were features of an individual post and post_discussion
level; which were features of the discussion the post belonged to.
For eg;
- Post-level features:
post_id
;post_author
- Discussion-level features:
post_discussion_id
;post_category_id
%matplotlib widget
import pandas as pd
from datasets import Dataset, load_from_disk, load_dataset
hf_dataset = load_from_disk("9_dataset_with_topics")
# hf_dataset = load_dataset("julep-ai/openai-community-posts")
df = hf_dataset.to_pandas()
hf_dataset.features
{'post_discussion_id': Value(dtype='int64', id=None), 'post_discussion_tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'post_discussion_title': Value(dtype='string', id=None), 'post_discussion_created_at': Value(dtype='timestamp[ns, tz=UTC]', id=None), 'post_category_id': Value(dtype='int64', id=None), 'post_discussion_views': Value(dtype='int64', id=None), 'post_discussion_reply_count': Value(dtype='int64', id=None), 'post_discussion_like_count': Value(dtype='int64', id=None), 'post_discussion_participant_count': Value(dtype='int64', id=None), 'post_discussion_word_count': Value(dtype='float64', id=None), 'post_id': Value(dtype='int64', id=None), 'post_author': Value(dtype='string', id=None), 'post_created_at': Value(dtype='string', id=None), 'post_content': Value(dtype='string', id=None), 'post_read_count': Value(dtype='int64', id=None), 'post_reply_count': Value(dtype='int64', id=None), 'post_author_id': Value(dtype='int64', id=None), 'post_number': Value(dtype='int64', id=None), 'post_discussion_related_topics': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'accepted_answer_post': Value(dtype='float64', id=None), 'post_content_raw': Value(dtype='string', id=None), 'post_category_name': Value(dtype='string', id=None), 'post_sentiment': Value(dtype='string', id=None), 'post_sentiment_score': Value(dtype='float64', id=None), 'post_content_cluster_embedding': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'post_content_classification_embedding': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'post_content_search_document_embedding': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'tag1': Value(dtype='string', id=None), 'tag2': Value(dtype='string', id=None), 'tag3': Value(dtype='string', id=None), 'tag4': Value(dtype='string', id=None), 'post_discussion_url': Value(dtype='string', id=None), 'post_url': Value(dtype='string', id=None), 'topic_model_medium': Value(dtype='string', id=None), 'topic_model_broad': Value(dtype='string', id=None)}
# Total number of posts
print("Total number of posts: ", len(df))
# Total discussions
print("Total discussions: ", len(df["post_discussion_id"].unique()))
# Total number of users
print("Total number of users: ", len(df["post_author_id"].unique()))
Total number of posts: 97033 Total discussions: 18990 Total number of users: 21419
# Earliest and latest post
print("Earliest post: ", df["post_created_at"].min())
print("Latest post: ", df["post_created_at"].max())
Earliest post: 2021-03-10T20:39:25.848Z Latest post: 2024-02-27T14:03:01.685Z
Apart from Post and Discussion level features, the following class features were computed;
- Sentiment
- Vector Embeddings
- Topic Models
- Twitter-roBERTa-base
Sentiment¶
Using Twitter-roBERTa-base for sentiment analysis, we generated a post_sentiment
label (negative, positive, neutral) and post_sentiment_score
confidence score for each post.
On average, most posts are neutral.
df["post_sentiment"].value_counts(ascending=True, normalize=True)
post_sentiment negative 0.185277 positive 0.219327 neutral 0.595395 Name: proportion, dtype: float64
However, by looking at the distribution per category, we see that the api
category and api/bugs
category has the most negative sentiment amongst different categories.
On the other hand, community
and gpts-builders/plugin-store
has the most positive sentiment.
which tracks as people often showcase cool projects, news and latest AI development in the
community
!
# Group by 'post_category_name' and then apply normalized value_counts to 'post_sentiment'
sentiment_percentages = df.groupby("post_category_name")["post_sentiment"].apply(
lambda x: x.value_counts(normalize=True)
)
# Convert the Series to a DataFrame and reset the index
# sentiment_percentages = sentiment_percentages.mul(
# 100
# ) # Convert fractions to percentages
sentiment_percentages = sentiment_percentages.reset_index(name="percentage")
# Pivot the table for better readability
pivot_df = sentiment_percentages.pivot(
index="post_category_name", columns="level_1", values="percentage"
)
# Fill NaN values with zero if any sentiment labels are missing in a category
pivot_df = pivot_df.fillna(0)
pivot_df.reset_index()
pivot_df.columns.rename(None, inplace=True)
# Display the pivoted DataFrame in descending order
pivot_df
negative | neutral | positive | |
---|---|---|---|
post_category_name | |||
api | 0.188675 | 0.624195 | 0.187131 |
api/bugs | 0.376378 | 0.533858 | 0.089764 |
api/deprecations | 0.161049 | 0.662921 | 0.176030 |
api/feedback | 0.261770 | 0.553672 | 0.184557 |
community | 0.137866 | 0.502298 | 0.359837 |
documentation | 0.137372 | 0.559727 | 0.302901 |
gpts-builders | 0.260511 | 0.597313 | 0.142176 |
gpts-builders/chat-plugins | 0.232624 | 0.538543 | 0.228833 |
gpts-builders/plugin-store | 0.187500 | 0.506944 | 0.305556 |
prompting | 0.133054 | 0.633530 | 0.233416 |
Vector Embeddings¶
For calculating vector embeddings, Nomic Embed-Text v1.5 was ran locally with the help of text-embeddings-inference. Because of it's Matryoshka resizable nature, it's possible to use these embeddings in a bunch of future applications.
Nomic Embed v1.5 was largely selected due to it's large context length.
import matplotlib.pyplot as plt
import seaborn as sns
df["post_content_raw_length"] = df["post_content_raw"].apply(len)
plt.figure(figsize=(12, 6))
sns.histplot(
df["post_content_raw_length"], bins=100, kde=False, cumulative=True, stat="density"
)
plt.title("CDF of Length Distribution of post_content_raw")
plt.xlabel("Length of post_content_raw")
plt.ylabel("Cumulative Density")
plt.show()
/home/glitch/.conda/envs/julep/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Looking at the Cumulative Distributed Frequency graph, we see that 99.7% of the posts have length less than 8192 characters. Since number of tokens is ~3/4 of character length, we can go ahead and vectorise the post_content_raw
with truncate=True
without worrying about a lot of knowledge and data being lost.
Nomic supports embeddings for searching, clustering and classification tasks.
We've computed all three types of embeddings over the post_content_raw
field.
df[
[
"post_content_cluster_embedding",
"post_content_classification_embedding",
"post_content_search_document_embedding",
]
]
post_content_cluster_embedding | post_content_classification_embedding | post_content_search_document_embedding | |
---|---|---|---|
0 | [-0.011406975, 0.051801503, -0.17560473, -0.02... | [-0.029118164, 0.031905293, -0.1484513, -0.015... | [-0.02271616, 0.033384237, -0.15369709, -0.017... |
1 | [0.075202845, 0.039509684, -0.21858266, -0.039... | [0.054288954, 0.008444463, -0.20109606, -0.042... | [0.06603962, 0.0113762, -0.15860154, -0.031844... |
2 | [0.081624806, 0.051376425, -0.21687175, -0.017... | [0.073775776, 0.034063034, -0.19114846, -0.007... | [0.049514644, 0.026369868, -0.17453438, -0.011... |
3 | [0.04684566, 0.07910612, -0.2271005, -0.007859... | [0.013259176, 0.015849816, -0.22634435, -0.022... | [0.012498306, -0.00900329, -0.092770934, 0.007... |
4 | [-0.016075207, 0.10314193, -0.22071771, -0.024... | [-0.034080368, 0.09957978, -0.20546404, -0.018... | [-0.015658986, 0.071472555, -0.19949938, -0.00... |
... | ... | ... | ... |
97028 | [0.04684566, 0.07910612, -0.2271005, -0.007859... | [0.013259176, 0.015849816, -0.22634435, -0.022... | [0.012498306, -0.00900329, -0.092770934, 0.007... |
97029 | [0.032625105, 0.052557576, -0.15643555, -0.055... | [0.019226272, 0.02287624, -0.1287021, -0.05793... | [0.012006246, 0.022498403, -0.10844656, -0.033... |
97030 | [0.01553116, 0.03656999, -0.15440144, -0.06329... | [-0.0005156845, 0.011319388, -0.11510259, -0.0... | [0.0049003367, 0.009971226, -0.12864526, -0.05... |
97031 | [0.03986051, 0.048007715, -0.17821708, -0.0489... | [0.0199232, 0.019354336, -0.14687058, -0.04700... | [0.016168084, 0.03449353, -0.16987395, -0.0337... |
97032 | [0.04684566, 0.07910612, -0.2271005, -0.007859... | [0.013259176, 0.015849816, -0.22634435, -0.022... | [0.012498306, -0.00900329, -0.092770934, 0.007... |
97033 rows × 3 columns
from nomic import atlas, AtlasDataset
from IPython.core.display import HTML
dataset = AtlasDataset(identifier="glitch/openai-community-posts---clustering---v2")
2024-03-20 16:22:42.072 | INFO | nomic.dataset:__init__:779 - Loading existing dataset `glitch/openai-community-posts---clustering---v2``.
HTML(dataset.maps[0]._embed_html())
Atlas is a very cool tool with a great set of filters. Feel free to peruse through the dataset above!
As a general rule, Search, Filter, Lasso and Cherry Pick tools on the left side help with selection and refinement of the datapoints.
View Settings on the right side has nifty visualisations.
For eg: Filter through all points where sentiment is negative and set the Color By in View Settings to post_discussion_views
logarithmically.
Vector Search¶
It is quite powerful to be able to execute similarity searches based on post IDs.
Allowing a Q&A interface using these embeddings over the post contents could speed up research over the community posts (if you know the right questions to ask :P).
Let's view some posts similar to this one complaining about function calling
map = dataset.maps[0]
neighbors, distances = map.embeddings.vector_search(ids=["Fjk"], k=7)
similar_datapoints = dataset.get_data(ids=neighbors[0])
for i, point in enumerate(similar_datapoints):
if i == 0:
print("Initial point:", point.get("post_discussion_title"), "\n")
print("Nearest neighbors:")
else:
print(point.get("post_discussion_title"))
Initial point: Gpt-4-1106-preview messes up function call parameters encoding Nearest neighbors: Gpt-4-1106-preview messes up function call parameters encoding When structuring the output of function calls, there is Chinese character encoding issue resulting in garbled text Confused on models that have function calling and when they get deprecated Gpt-4-1106-preview is not generating utf-8 Gpt-3.5-turbo-1106 Calls multiple of the same function unecessarily There is a mistake on the doc page of function calling
Preliminary Data Analysis¶
On completing feature engineering, we are left with 36 total features that we can explore.
Here, we attempt to give some basic information about the dataset and it's features which one could potentially continue from.
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
print("Total features:", df.columns.__len__())
Total features: 36
og_df = df.copy(deep=True)
Correlation Heatmap
Logically, replies, likes, as well as word count are highly correlated with each other.
Interestingly, accepted answers in a post also boost the views of the discussion.
# Identify columns that contain lists, arrays or strings
cols_to_exclude = [
col
for col in df.columns
if df[col].apply(lambda x: isinstance(x, (list, np.ndarray, str))).any()
]
cols_to_exclude.extend(
[
"post_discussion_id",
"post_category_id",
"post_id",
"post_author_id",
]
)
# Create a new DataFrame that only includes columns with single numerical values
df_numerical = df.drop(columns=cols_to_exclude)
# Calculate the correlation matrix
corr = df_numerical.corr()
# Plot the heatmap
plt.figure(figsize=(15, 10))
sns.heatmap(corr, annot=False, cmap="coolwarm")
plt.xticks(rotation=45) # Rotate x-axis labels
plt.tight_layout() # Adjust plot margins
plt.show()
2023 is when interest around OpenAI and thus, it's community really start building up. OpenAI Dev Day on Nov. 2023 led to a huge increase in interest around OpenAI.
It would be interesing to see how many users joined the community each month too!
Volume of Posts Over Time
# Count sentiment labels by month
df["post_created_at"] = pd.to_datetime(df["post_created_at"])
df["year_month"] = df["post_created_at"].dt.to_period("M")
# Count sentiment labels by month
colors = {"negative": "red", "neutral": "blue", "positive": "green"}
sentiment_label_counts_by_month = (
df.groupby(["year_month", "post_sentiment"]).size().unstack(fill_value=0)
)
# Calculate proportions of sentiment labels by month
total_posts_per_month = sentiment_label_counts_by_month.sum(axis=1)
sentiment_label_proportions_by_month = sentiment_label_counts_by_month.divide(
total_posts_per_month, axis=0
)
sentiment_label_counts_by_month.plot(
kind="bar",
stacked=True,
figsize=(14, 8),
color=[colors[col] for col in sentiment_label_counts_by_month.columns],
)
plt.title("Volume of Posts Over Time")
plt.xlabel("Month")
plt.ylabel("Number of Posts")
plt.xticks(rotation=45)
plt.legend(title="Sentiment")
plt.tight_layout()
plt.show()
/tmp/ipykernel_186231/1065315333.py:3: UserWarning: Converting to PeriodArray/Index representation will drop timezone information. df["year_month"] = df["post_created_at"].dt.to_period("M")
Average Sentiment Over Time
In the same vein a significantly larger number of people seem to be happy post Dev Day!
df["post_created_at"] = pd.to_datetime(df["post_created_at"])
# Set the 'post_created_at' column as the index
df.set_index("post_created_at", inplace=True)
monthly_sentiment = (
df.resample("ME")["post_sentiment"].value_counts().unstack(fill_value=0)
)
# Plotting
plt.figure(figsize=(15, 8))
plt.plot(monthly_sentiment.index, monthly_sentiment.values)
for sentiment in monthly_sentiment.columns:
plt.plot(
monthly_sentiment.index, monthly_sentiment[sentiment], color=colors[sentiment]
)
# Formatting the x-axis to show Month-Year
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%b-%Y"))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
# Improve x-axis labels readability
plt.gcf().autofmt_xdate()
plt.title("Average Sentiment Over Time")
plt.xlabel("Time")
plt.ylabel("Number of Posts")
plt.grid(True)
plt.show()
Engagement metrics over Time
Engagement understandably peaked around the two main events last year; Launch of GPT-4 and OpenAI Dev Day.
aggregated_data = df.resample("ME", on="post_discussion_created_at").agg(
{
"post_discussion_views": "sum",
"post_discussion_like_count": "sum",
"post_discussion_reply_count": "sum",
}
)
fig, ax1 = plt.subplots(figsize=(15, 7))
color = "tab:red"
ax1.set_xlabel("Time")
ax1.set_ylabel("Views", color=color)
ax1.plot(aggregated_data.index, aggregated_data["post_discussion_views"], color=color)
ax1.tick_params(axis="y", labelcolor=color)
ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
color = "tab:blue"
ax2.set_ylabel(
"Likes and Replies", color=color
) # we already handled the x-label with ax1
ax2.plot(
aggregated_data.index,
aggregated_data["post_discussion_like_count"],
color="blue",
label="Likes",
)
ax2.plot(
aggregated_data.index,
aggregated_data["post_discussion_reply_count"],
color="green",
label="Replies",
)
ax2.tick_params(axis="y", labelcolor=color)
# Add a horizontal line and a label at November 2023
dev_day = mdates.date2num(
pd.to_datetime("2023-11-06")
) # Convert the date to matplotlib's internal format
gpt4_launch = mdates.date2num(pd.to_datetime("2023-03-14"))
ax2.axvline(dev_day, color="black", linestyle="--") # Add a vertical line
ax2.axvline(gpt4_launch, color="black", linestyle="--") # Add a vertical line
ax2.text(
dev_day,
ax2.get_ylim()[1],
"OpenAI Dev Day",
horizontalalignment="left",
verticalalignment="top",
) # Add a label
ax2.text(
gpt4_launch,
ax2.get_ylim()[1],
"GPT-4 Launch",
horizontalalignment="left",
verticalalignment="top",
) # Add a label
fig.tight_layout() # otherwise the right y-label is slightly clipped
plt.legend(loc="upper left")
plt.show()
df = og_df.copy(deep=True)
Weighted Sentiment Score Over Time
Let's compute a weighted sentiment score for each topic and plot it over time.
The weighted score has the following range:
- -0.1 to 0.1: Neutral
- -1.0 to -0.1: Negative
- 0.1 to 1.0: Positive
df["post_created_at"] = pd.to_datetime(df["post_created_at"])
# Assign numeric values to sentiment labels.
sentiment_numeric = {"negative": -1, "neutral": 0, "positive": 1}
df["sentiment_numeric"] = df["post_sentiment"].map(sentiment_numeric)
# Calculate weighted sentiment score.
df["weighted_sentiment"] = df["sentiment_numeric"] * df["post_sentiment_score"]
# Group by topic and month, then calculate the average weighted sentiment.
df["year_month"] = df["post_created_at"].dt.to_period("M")
grouped = (
df.groupby(["topic_model_broad", "year_month"])["weighted_sentiment"]
.mean()
.reset_index()
)
# Pivot for easier plotting.
pivot_table = grouped.pivot(
index="year_month", columns="topic_model_broad", values="weighted_sentiment"
)
df["adjusted_weight"] = df.apply(
lambda row: (
row["post_sentiment_score"] * 1.5
if row["sentiment_numeric"] != 0
else row["post_sentiment_score"]
),
axis=1,
)
# Calculate weighted sentiment score using the adjusted weights.
df["weighted_sentiment_adjusted"] = df["sentiment_numeric"] * df["adjusted_weight"]
# Group by topic and month, then calculate the average adjusted weighted sentiment.
df["year_month"] = df["post_created_at"].dt.to_period("M")
grouped_adjusted = (
df.groupby(["topic_model_broad", "year_month"])["weighted_sentiment_adjusted"]
.mean()
.reset_index()
)
# Pivot for easier plotting.
pivot_table_adjusted = grouped_adjusted.pivot(
index="year_month",
columns="topic_model_broad",
values="weighted_sentiment_adjusted",
)
# Plotting
df["post_created_at"] = pd.to_datetime(df["post_created_at"])
# Assign numeric values to sentiment labels.
sentiment_numeric = {"negative": -1, "neutral": 0, "positive": 1}
df["sentiment_numeric"] = df["post_sentiment"].map(sentiment_numeric)
# Calculate weighted sentiment score.
df["weighted_sentiment"] = df["sentiment_numeric"] * df["post_sentiment_score"]
# Group by topic and month, then calculate the average weighted sentiment.
df["year_month"] = df["post_created_at"].dt.to_period("M")
grouped = (
df.groupby(["topic_model_broad", "year_month"])["weighted_sentiment"]
.mean()
.reset_index()
)
# Pivot for easier plotting.
pivot_table = grouped.pivot(
index="year_month", columns="topic_model_broad", values="weighted_sentiment"
)
df["adjusted_weight"] = df.apply(
lambda row: (
row["post_sentiment_score"] * 1.5
if row["sentiment_numeric"] != 0
else row["post_sentiment_score"]
),
axis=1,
)
# Calculate weighted sentiment score using the adjusted weights.
df["weighted_sentiment_adjusted"] = df["sentiment_numeric"] * df["adjusted_weight"]
# Group by topic and month, then calculate the average adjusted weighted sentiment.
df["year_month"] = df["post_created_at"].dt.to_period("M")
grouped_adjusted = (
df.groupby(["topic_model_broad", "year_month"])["weighted_sentiment_adjusted"]
.mean()
.reset_index()
)
# Pivot for easier plotting.
pivot_table_adjusted = grouped_adjusted.pivot(
index="year_month",
columns="topic_model_broad",
values="weighted_sentiment_adjusted",
)
# Plotting
plt.figure(figsize=(14, 8))
pivot_table_adjusted.index = pivot_table_adjusted.index.to_timestamp()
pivot_table_adjusted.drop(["Emoji (8)"], axis=1, inplace=True)
for column in pivot_table_adjusted.columns:
clean_series_adjusted = pivot_table_adjusted[column].dropna()
plt.plot(
clean_series_adjusted.index,
clean_series_adjusted,
marker="",
linewidth=2,
label=column,
)
# Add horizontal bands
plt.fill_between(clean_series_adjusted.index, -0.1, 0.1, color="blue", alpha=0.1)
plt.fill_between(clean_series_adjusted.index, 0.1, 1.0, color="green", alpha=0.1)
plt.fill_between(clean_series_adjusted.index, -1.0, -0.1, color="red", alpha=0.1)
plt.title("Average Adjusted Weighted Sentiment Score Over Time by Topic Model")
plt.xlabel("Time")
plt.ylabel("Average Adjusted Weighted Sentiment Score")
plt.legend(title="Topic Model", loc="best")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
/tmp/ipykernel_186231/2422921945.py:11: UserWarning: Converting to PeriodArray/Index representation will drop timezone information. df["year_month"] = df["post_created_at"].dt.to_period("M") /tmp/ipykernel_186231/2422921945.py:37: UserWarning: Converting to PeriodArray/Index representation will drop timezone information. df["year_month"] = df["post_created_at"].dt.to_period("M") /tmp/ipykernel_186231/2422921945.py:62: UserWarning: Converting to PeriodArray/Index representation will drop timezone information. df["year_month"] = df["post_created_at"].dt.to_period("M") /tmp/ipykernel_186231/2422921945.py:88: UserWarning: Converting to PeriodArray/Index representation will drop timezone information. df["year_month"] = df["post_created_at"].dt.to_period("M")