OpenAI, the research organization behind the groundbreaking GPT-3 language model, has announced the launch of a new web crawler named GPTBot, which aims to collect publicly available data from the internet for training AI models. The launch comes amidst recent controversies where tech companies were accused of scraping websites without explicit consent to power large language models like GPT-4.
Web scraping is the process of extracting data from websites, usually for a specific purpose. Web scraping can be used for various applications, such as price comparison, data analysis, archiving, and research. However, web scraping can also raise ethical and legal issues, such as violating terms of service, infringing intellectual property rights, and invading privacy.
User agent token: GPTBot Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
OpenAI claims that GPTBot is different from other web crawlers, as it respects the wishes of webmasters and properly identifies itself. The bot uses the user agent token “GPTBot” and a full user-agent string clearly stating it is from OpenAI. Webmasters can allow or disallow access to GPTBot by using the robots.txt file on their websites.
According to OpenAI, GPTBot's main purpose is to gather data for improving future AI models and enhancing their general capabilities and safety. The bot only crawls web pages that are publicly accessible and do not require paywall access, personal information, or violate OpenAI's policies. OpenAI also states that it does not train its models on inputs and outputs through its API.
User-agent: GPTBot Disallow: /
User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/
GPTBot is part of OpenAI's efforts to create more transparent and ethical AI systems, following its previous initiatives such as ChatGPT, a conversational model that interacts in a human-like way, and InstructGPT, a model that follows instructions in a prompt and provides detailed responses. Both models were trained using reinforcement learning from human feedback.
OpenAI hopes that GPTBot will help AI models become more accurate and useful, while also respecting the rights and preferences of website owners and users. The organization invites feedback and suggestions from the public on how to improve GPTBot and its policies.