What do AI chatbots know about us, and who are they sharing it with?

Use the programs with caution, just like any other application.

AI Chatbots are relatively old by technical standards, but the latest crop—led by OpenAI's ChatGPT and Google's Bard—are far more capable than their forebears, not always for positive reasons. The recent explosion in the development of artificial intelligence has already raised concerns about misinformation, disinformation, plagiarism and machine-generated malware. What problems might generative artificial intelligence pose for the privacy of the average internet user? The answer, experts say, is largely a question of how these robots are trained and how much we plan to interact with them.

In order to replicate human-like interactions, AI chatbots are trained on large amounts of data, much of which comes from repositories like Common Crawl. As the name suggests, Common Crawl has accumulated years and petabytes of data just by crawling and scraping the open web. "These models are trained on large datasets of publicly available data on the Internet," said Megha Srivastava, a PhD student in Stanford's Department of Computer Science and a former AI-in-residence at Microsoft Research. Although ChatGPT and Bard use what they call a "filtered" portion of the Common Crawl data, the sheer size of the model "makes it impossible for anyone to look at the data and sanitize it," according to Srivastava.

Either due to your own carelessness or the poor security practices of a third party, it could find its way into some remote corner of the internet right now. While this may be difficult for the average user to access, it is possible that the information was scraped into the training set and could be retrieved by the chatbot. And a bot spitting out someone's real contact information is by no means a theoretical problem. Bloomberg columnist Dave Lee posted on Twitter that when someone asked ChatGPT to chat on encrypted messaging platform Signal, he provided his exact phone number. This kind of interaction is probably an edge case, but the information these learning models have access to is still worth considering. "It's unlikely that OpenAI would want to collect specific information, such as health data, and attribute it to individuals to train its models," David Hoelzer, a fellow at the security organization SANS Institute, told Engadget. “But could he be there unintentionally? Absolutely."

Open AI, the company behind ChatGPT, did not respond when we asked what measures it takes to protect data privacy or how it handles personally identifiable information that may be loaded into its training sets. So we did the next best thing and asked ChatGPT itself. He told us that he is "programmed to follow ethical and legal standards that protect users' privacy and personal information" and that he "doesn't have access to personal information unless he gives it to me." Google, for its part, told Engadget that it has programmed similar guardrails into Bard to prevent the sharing of personally identifiable information during conversations.

ChatGPT introduced the second major vector through which generative AI can pose a privacy risk: the use of the software itself—either through information shared directly in chatlogs or device and user information captured by the service during use. OpenAI's privacy policy cites several categories of standard information it collects about users that could be identifiable, and ChatGPT warns upon launch that conversations may be reviewed by its AI trainers to improve systems.

Meanwhile, Google's Bard doesn't have a separate privacy policy, instead using a general privacy document shared by other Google products (and which happens to be extremely broad.) Conversations with Bard don't need to be saved to a user's Google account, and users can delete conversations via Google, Engadget reported. "In order for users to build and maintain trust, they will need to be very transparent about their privacy policies and data protection practices on the front end," Rishi Jaitly, a professor and distinguished humanitarian at Virginia Tech, told Engadget.

Despite the "clear conversations" action, pressing it won't actually delete your data, according to the service's FAQ page, nor can OpenAI remove specific prompts. While the company discourages users from sharing anything sensitive, it appears the only way to remove personally identifiable information provided by ChatGPT is to delete your account, which the company says will permanently delete all associated data.

Hoelzer told Engadget that he's not worried about ChatGPT taking individual conversations to learn. But this conversation data is stored somewhere, so its security becomes a reasonable concern. By the way, ChatGPT was briefly offline in March because a programming error exposed information about users' chat history. In the early stages of their widespread deployment, it is unclear whether chat logs from these kinds of AIs will become valuable targets for malicious actors.

For the foreseeable future, it's best to treat these kinds of chatbots with the same suspicion that users would treat any other tech product. "A user playing with these models should go in with the expectation that any interaction they have with the model," Srivastava told Engadget, "is fair game for Open AI or any of these other companies to use to their advantage."

What do AI chatbots know about us, and who are they sharing it with?

Post a Comment

Search This Blog

Top Posts/Right Now

Labels