How to block OpenAI’s new AI-training web crawler from ingesting your data

How to block OpenAI’s new AI-training web crawler from ingesting your data
A man is seen using the OpenAI ChatGPT artificial intelligence chat website in this illustration photo on 18 July, 2023. (Photo by Jaap Arriens/NurPhoto via Getty Images)

Jaap Arriens/NurPhoto by way of Getty Images

ChatGPT creator OpenAI has launched a new web crawler — known as GPTBot — together with instructions on how to block it. 

ChatGPT is likely one of the most succesful AI methods ever constructed, regardless of latest stories of its wavering intelligence. OpenAI, the corporate behind the AI chatbot, continues to prepare its giant language fashions (LLMs), like GPT-3.5 and GPT-4.

Also: ChatGPT is getting a slew of updates this week. Here’s what you want to know

Web crawlers, utilized by engines like google like Google and Bing to scan web sites and index content material, are additionally utilized by AI firms to prepare LLMs. These fashions be taught from the content material of internet sites and another data its builders select to prepare them on. Using a web crawler expedites this course of by enabling the LLMs to prepare on huge quantities of data.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI notes in its GPTBot documentation. The firm claims it’s filtering out web pages that require paywall entry, collect personally-identifying data, and have textual content violating OpenAI’s insurance policies 

Developers have the choice of blocking the GPTBot from accessing their websites and utilizing their data to prepare AI methods. 

OpenAI explains how to disallow or customise GPTBot entry to your web site.

Screenshot: OpenAI | Image Composition: Maria Diaz/ZDNET

To block GPTBot from accessing a web site altogether, the positioning proprietor can add the GPTBot token to the positioning’s robots.txt and “Disallow: /”. 

OpenAI additionally lets customers customise GPTBot’s entry by solely letting it crawl sure elements of their web site. To block GPTBot from accessing elements of a web site, add GPTBot to the positioning’s robots.txt and “Allow: /directory-1/” and “Disallow: /directory-2/” and customise as wanted.

Also: Nvidia boosts its ‘superchip’ Grace-Hopper with quicker reminiscence for AI

OpenAI had not beforehand introduced the usage of web crawlers to prepare GPT-3.5, the LLM behind the free model of ChatGPT, or GPT-4, its latest LLM accessible to ChatGPT Plus subscribers and that powers Bing AI. 

Though it is unclear if GPTBot was used to prepare OpenAI’s at the moment accessible LLMs, it might be the web crawler coaching GPT-5, particularly as the corporate filed to trademark the title in July. While OpenAI has not introduced a launch date for GPT-5, the new LLM is predicted to be extra highly effective and bigger than GPT-4, which is at the moment the most important LLM accessible.

Also: AI bots may quickly turn into your new customer support agent

Since the launch of ChatGPT, OpenAI has been hit with a number of lawsuits alleging that the AI device is stealing data from customers, together with a copyright infringement case that made the corporate the goal of an FTC investigation. Websites like Stack Overflow, Reddit, and Twitter have mentioned they plan to start charging AI firms to entry their data.

Editorial requirements

…. to be continued
Read the Original Article
Copyright for syndicated content material belongs to the linked Source : ZDNet – https://www.zdnet.com/article/how-to-block-openais-new-ai-training-web-crawler-from-ingesting-your-data/#ftag=RSSbaffb68

Exit mobile version