Skip to content

Why web scraping matters?

Every day we rely on web scraping without thinking about it.

Search engines like Google crawl websites since the late 1990s to index pages. Flight comparison websites like Skyscanner (2003) and Kayak (2004) collect flight listings from airline websites. Job websites like Indeed (2004) fetch job listings from company career pages. Real estate portals such as Zillow use public listings from MLS feeds and scraped sources. Even AI models like ChatGPT are trained using publicly available data collected from the internet.

Web scraping is one of the reasons we can search, compare, monitor reviews, see stock prices, check weather data, track hotel ratings, or follow what people are saying on social media.

So scraping itself is not hacking or illegal by default. It depends on how the data is accessed and what someone does with it. In most cases, scraping publicly available data is legal as long as:

  • The data is visible to the public without login or a paywall
  • No security measures are bypassed
  • The scraper does not overload or damage the website
  • The data does not contain private or personal information protected by privacy laws

This has been confirmed in multiple court rulings.

Key Court Rulings

Van Buren v. United States – U.S. Supreme Court (June 3, 2021)
The U.S. Supreme Court ruled that someone does not violate the Computer Fraud and Abuse Act (CFAA) simply because they misused data they were allowed to access. A person only violates the law when they access information they were not authorized to access in the first place.
Source: supremecourt.gov/opinions/20pdf/19-783_k53l.pdf

hiQ Labs v. LinkedIn – Ninth Circuit Court (2019 and 2022)
hiQ scraped public LinkedIn profiles to provide workforce analytics. LinkedIn tried to block them and claimed scraping violated the CFAA. In 2019, the Ninth Circuit ruled that scraping data that is publicly visible without login is not a crime under CFAA. After the Supreme Court asked the court to review it again, the Ninth Circuit reaffirmed its decision on April 18, 2022.
Source: https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf

Meta (Facebook/Instagram) v. Bright Data – US District Court – Northern District of California (January 23, 2024)
Meta sued Bright Data, a company that scraped Facebook and Instagram for public profile data and sold it. The court ruled against Meta, saying that Bright Data did not violate Meta’s terms of service because it was not accessing data as a logged-in user and did not use fake accounts. Meta later dropped the case on February 26, 2024.
Source: https://www.courthousenews.com/federal-judge-rules-against-meta-in-data-scraping-case

eBay v. Bidder’s Edge – US District Court – Northern District of California (May 24, 2000)
Earlier courts sometimes treated heavy bots as a trespass to servers. eBay won an injunction by arguing that mass requests interfered with its systems, a theory later viewed more cautiously but still a warning against flooding a site.
Source: https://en.wikipedia.org/wiki/EBay_v._Bidder’s_Edge

Together, these decisions point to a simple idea: scraping public pages is not treated as hacking, but aggressive, high-volume activity that degrades a service can still create liability.

Do Terms of Service Make Scraping Illegal?

Website Terms of Service (ToS) are a contract between the platform and a user. If a scraper does not create an account or access data while logged in, they are often not considered to be bound by the ToS in a contractual way.

The hiQ case confirmed that simply visiting a public website and collecting data does not violate federal anti-hacking laws, even if the ToS says scraping is not allowed.

However, a platform can still technically block access, send a cease-and-desist letter, or sue for breach of contract if the scraper agreed to the terms by logging in or creating fake accounts.

robots.txt is a long-standing convention that tells crawlers which paths a site would prefer they avoid. It is a coordination signal, not a statute. Ignoring robots.txt is not, by itself, a crime, and there is no general rule that makes it legally binding. That said, ignoring it while also causing strain can be used to argue bad faith in a dispute. Recent academic work describes robots.txt as a widely adopted technical standard rather than a legal rule.
Source: https://www.sciencedirect.com/science/article/abs/pii/S2212473X25000495

When a person writes a review on Google, Yelp, Booking, TripAdvisor, or a similar site, they own the copyright in their text from the moment it is created. The platform does not own the text; it holds a license under its terms to display the review. U.S. Copyright Office guidance explains that copyright attaches automatically to original works fixed in a medium, which includes user-authored text.
Source: https://www.copyright.gov/help/faq/faq-general.html

Because reviewers post with the intent of making their words public, collecting public reviews is not the same as copying private files. Still, re-using reviews brings duties. If a third party displays reviews elsewhere, it should identify the original platform and, when public, the author or username, much like a citation. The text should not be altered in a way that changes meaning. If a review is edited or removed at the source, continuing to show an outdated version is misleading and risks both legal and consumer-trust problems. By contrast, facts such as a business name, address, and opening hours are not protected by copyright, while platform-generated features like average ratings and layout are controlled by the platform.

Privacy laws: GDPR, CCPA, and public data

Public visibility does not erase privacy duties. Under the EU’s GDPR, any information that can identify a person, directly or indirectly, counts as personal data. Using public data about identified people requires a lawful basis, such as legitimate interests, and a fair, transparent purpose. GDPR Article 6 lists the lawful bases for processing; legitimate interests can apply to monitoring public feedback or brand mentions, but it must be balanced against the rights of individuals. Aggregation and minimization reduce risk; publishing profiles about individuals or combining data for targeting raises it. GDPR+1

California’s CCPA gives state residents rights to know, access, delete, and opt out of the sale or sharing of their personal information. CCPA focuses on the consumer’s control over personal data rather than banning collection of public facts, but if a business sells or shares personal information collected from public pages, opt-out rights and other duties apply. The Attorney General’s overview explains these rights and obligations. California DOJ

In practice, scraping public reviews and public social posts for analytics, research, or display with attribution can fit within these frameworks when data about individuals is handled with restraint, when there is a clear, lawful purpose, and when removal requests are honored.

Ethical Scraping

Law and ethics tend to line up on method. Problems usually arise not from the idea of collecting public pages, but from how it is done. Reasonable request rates, caching, back-off on errors, and respect for operational limits help avoid the kind of system impact that drove cases like eBay v. Bidder’s Edge. Identifying your crawler in the User-Agent shows intent to be a good actor. Avoiding fake accounts or logins maintains the bright line between public access and gated access. Reading robots.txt as guidance, not as law, helps coordinate with site operators. When content is displayed, keep it accurate, name the source and author when public, and stop showing it if the source changes in material ways. These habits reduce legal risk and build trust with users who rely on the integrity of the data you surface. Justia Law

Even when scraping is legal, it should still be done responsibly.

  • Do not send too many requests too quickly
  • Identify your scraper when possible (User-Agent)
  • Do not bypass login systems or CAPTCHAs
  • Respect update requests or removal requests if someone wants their review taken down
  • If you publish scraped data, keep it accurate and linked to the source

Scraping Listings, Reviews, and Social Media – What Is Allowed?

Generally allowed if:

  • Data is public (hotel listings, product reviews, Instagram public posts)
  • No login, no fake accounts, no password bypassing
  • Original text is not modified
  • Source is credited when redistributed or displayed

Risky or likely illegal if:

  • Private data, contact info, or user profiles behind a login are scraped
  • Fake accounts or bots are used to log in and gather data
  • Captchas, rate limits, or paywalls are bypassed
  • Data is copied and published to replace the original source

Zembra’s Approach to Responsible and Compliant Data Collection

Zembra only collects information that is publicly visible on the web and does not bypass logins, paywalls, or technical barriers to gain access to data. It does not use fake accounts or simulate user behavior to reach content that is meant to be private. The goal is to work strictly within what is already openly accessible, in the same way a regular user or search engine could view it.

Reviews and social content remain the property of the original authors. When Zembra processes this content, it preserves the original wording and keeps a link to the source platform and author when publicly available. The text is not altered, rephrased, or filtered in a way that changes its meaning. If a review is updated or removed on the original platform, Zembra removes or updates its copy to avoid showing outdated or misleading information.

Zembra applies privacy rules seriously. If any personal data appears within scraped content such as an email address, phone number, full name, or other forms of PII it is automatically detected and removed from storage and output. When possible, and especially if the PII appears in a location where it likely should not be public, Zembra alerts the platform or source so the data can be reviewed or taken down at the origin.

To remain compliant with GDPR, CCPA, and similar laws, Zembra processes only what is necessary, does not sell personal data, and honors valid requests for removal from businesses, platforms, or original reviewers. Collection rates are monitored and kept within reasonable limits to avoid affecting website performance, and caching is used where possible to reduce repeated access to the same content.

The purpose is not to republish or replace the original source of data. It is to make publicly available information easier for businesses and platforms to monitor, analyze, and act on without breaching privacy, copyright, or the rights of original authors.

Conclusion

Scraping is part of how the web works. Courts have confirmed that collecting public pages is not the same as breaking into a private system, and that contract terms do not bind people who never agreed to them simply by visiting public pages. At the same time, copyright, privacy rules, and operational limits matter. If you collect public listings, reviews, or social posts, keep the text faithful to the original, cite the source, avoid heavy traffic, and handle personal data with care under GDPR and CCPA. That blend of lawful access and responsible use is what lets the wider internet benefit from public data without harming the sites and people who publish it.

  • Van Buren v. United States, U.S. Supreme Court opinion, June 3, 2021 (PDF). Supreme Court
  • hiQ Labs, Inc. v. LinkedIn Corp., Ninth Circuit opinion, April 18, 2022 (PDF). Ninth Circuit Court of Appeals
  • Federal judge rules against Meta in data-scraping case (report on Bright Data ruling), Jan 23, 2024. courthousenews.com
  • Meta v. Bright Data filings and order excerpts (document repository). digitalcommons.law.scu.edu
  • eBay v. Bidder’s Edge background summary (trespass-to-chattels theory). Wikipedia
  • eBay v. Bidder’s Edge original district court order (N.D. Cal. 2000). Justia Law
  • Academic discussion: robots.txt as a technical standard, not a legal rule. ScienceDirect
  • GDPR Article 6 lawful bases for processing (legitimate interests and others). GDPR
  • UK ICO guide on the legitimate interests basis (plain-language guidance). ICO
  • California Attorney General overview of CCPA consumer rights and duties. California DOJ
  • U.S. Copyright Office, Copyright FAQ (automatic protection of original text). U.S. Copyright Office

Other Resources

“Is web scraping legal? Yes, if you know the rules.” – Apify Blog, May 26 2025.
This article lays out the fundamentals of scraping public data in the US, EU and UK, and explains when it becomes risky. Apify Blog

“Is Web Scraping Legal? Quick Answer” – ScrapingBee Blog, Oct 3 2025.
A concise, up-to-date breakdown of myths, legal questions and jurisdictional issues around web scraping. ScrapingBee

“Scraping public data. Is it legal?” – WebScraper.io Blog, March 3 2022.
Focuses on scraping public data and privacy law interplay (such as GDPR) with practical examples. WebScraper

“The Legal Landscape of Web Scraping” – Quinn Emanuel Publication, April 28 2023.
Legal-firm analysis of the risks, business uses and how companies rely on scraping. Quinn Emanuel

“Web Scraping Laws” – TermsFeed Blog, 8 months ago.
Emphasizes PII (personally identifiable information) risks when scraping, and helps with privacy law references. TermsFeed