Full podcast: youtu.be/i1NlQixGW2I
Now that the seal is broken on scraping Bluesky posts into datasets for machine learning, people are trolling users and one-upping each other by making increasingly massive datasets of non-anonymized, full-text Bluesky posts taken directly from the social media platform’s public firehose—including one that contains almost 300 million posts.
Last week, Daniel van Strien, a machine learning librarian at open-source machine learning library platform Hugging Face, released a dataset composed of one million Bluesky posts, including when they were posted and who posted them. Within hours of his first post—shortly after our story about this being the first known, public, non-anonymous dataset of Bluesky posts, and following hundreds of replies from people outraged that their posts were scraped without their permission—van Strein took it down and apologized.
This is a production of 404 Media, a journalist-owned tech website. Learn more and subscribe at: htttps://404media.co
Listen to our weekly podcasts:
Apple Podcasts: https://podcasts.apple.com/us/podcast/the-404-media-podcast/id1703615331?ref=404media.co
Spotify: https://open.spotify.com/show/0F3oY47l2XgoBMaAmIaw29?ref=404media.co
Google Podcasts: https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5hY2FzdC5jb20vcHVibGljL3Nob3dzL3RoZS00MDQtbWVkaWEtcG9kY2FzdA?ref=404media.co
Become a paid subscriber for access to bonus content: https://404media.co/membership …...more
...more
Show less