Yeah, nobody knows! (Well, nobody knows until they try it out and see). Maybe the tweets fall naturally into just two clusters. Maybe five? Maybe 10? Maybe no matter what number of clusters you choose, the tweets still don't seem to fall into sensible groupings, because the data simply doesn't cluster very well (although I hope that's not the case!). No matter how many clusters you have, there will always be some outliers that don't really fit well into any cluster, but get assigned to a cluster anyway because that's what the clustering algorithm does (at least, if you use KMeans).
More broadly, this is what academic "research" is like -- tackling open problems, where the answer isn't known ahead of time, and you just have to try things, dig around, and see what you can find. Some of the clusters might have a clear/meaningful theme, but others might not.
Some things to look at:
a) what words/terms are most common in each of the clusters
b) look at several of the whole tweets that were assigned to each cluster.
For part b, let's assume you have a clustering model variable is named model
, similar to this short example I mentioned in the final project spec, and you have a dataframe named df
that contains all the tweets.
after you have fit your model (to create the clusters), then...
model.labels_
is a giant array (the same length as df
), which tells you which cluster (0...N) each tweet was assigned to.
You can insert this information back into the dataframe using:
df['cluster_id'] = model.labels_
You can export the whole dataframe to a CSV file (that you can open in excel, and sort by labels), using:
df.to_csv("/content/drive/My Drive/FinalProj/clustered_tweets.csv")
to save it in the FinalProj folder on your Google Drive
Or you can save it to the default colab temporary folder using:
df.to_csv("/content/clustered_tweets.csv")
and then you can download the file from the little file browser built into colab.
Of course, you can also sort/filter the dataframe using pandas (instead of Excel). For example:
df[df['cluster_id'] == 2]
would give you a filtered dataframe containing only those tweets that were assigned to cluster 2.