🌐
Videos Blog About Series πŸ—ΊοΈ
❓
πŸ”‘

The war on scraping is lost for the same reason as the war on piracy πŸ”—
1635962678  

🏷️ blog

A great deal of effort is expended upon anti-scraping measures in webpages. There are a number of reasons for this:

  • prohibitive bandwidth costs involved in allowing bulk downloads
  • wanting tight control over how users view the data in order to influence their conclusions
  • competitors stealing content and reproducing their service in a jurisdiction in which they have no legal recourse.
The first concern is best addressed by rate-limiting mechanisms or metering fees. The second concern has become quite a heated topic for social media of late. This is not a problem for most services, as it's not good business to second guess paying customers. On the other hand, if the business is advertising (as it is with social media) influence is precisely the point.

For most businesses the concern will be the last one. I've learned over my career programming that data oriented design not only results in faster code, but less code. It's entirely possible to build a successful business with entirely open source code but proprietary data using this model. That said, it makes one uniquely vulnerable to such data theft.

Can we fix it?

Enter anti-scraping technology. For a good overview of the current landscape, see here. You may have noticed the core problem is "fingerprinting", which is essentially the same one to solve with software licensing. This is because it's the same exact problem as software piracy, programs are just data that transform other data.

Those of you familiar with implementing software licensing schemes like I have are well aware that basically everything but phone-homes coupled with fingerprinting are not worth pursuing. Even then, there is no real way to prevent people from nopping your checks out. Generally you see mechanisms to ensure that a crack for one version does not work on the next. This has resulted in a status quo of customers submitting to this stick with the carrot of forward-going code updates.

Which is to say a stalemate in the immediate term, but total surrender in the long term. The only reliable way to prevent this is to never allow clients to interpret your code. Even then side-channel attacks are possible to reverse-engineer it.

This model breaks down for targeted and simple programs, as after some point there's nothing to update. I suspect this is much of the reason that observation of Zawinski's law is so prevalent in the software industry. There is however no such concern with data, as you can always add more. The video game industry in particular has embraced this with zeal. Expansion content not only drives much of sales, it also works quite well to keep their content artisans fully employed when they might otherwise have downtime.

You may have noticed that the ultimate remedy available to software is not exactly feasible for data. Data cannot be fully obscured from the client in nearly all use cases. Anti-scraping measures (as you can see from the overview) have also failed almost comprehensively. This has had far-reaching effects on a number of industries.

Tech Blogging has been totally smothered by plagiarists who know how to do SEO. The only real reason to do one nowadays is as a big "hire me" billboard. My father was an inventor with a number of patents, and he discovered (the hard way) that they were also useless besides as an inducement for employment. Almost every social media platform which started out with good APIs have now comprehensively crippled or dropped them altogether and an industry of scraping based tools have popped up to satisfy this need. Plaid became a multibillion dollar company by doing scraping of bank websites using bank clients own logins.

Like with software licensing it begs the question of why any of this effort is expended at all, given it's ultimately Canute screaming at the tides. This comes down to legal reasons. The courts generally say that you "had it coming" if you left a gold bar in the middle of the street and it gets stolen. So it is with software and data. If you don't at least make a token effort at anti-circumvention you have no recourse. Of course, this is not consistently applied to all firms and jurisdictions but such is the law. If we wanted consistent outcomes, we'd replace black robes and powdered wigs with programs. Even then this has no bearing internationally, as most firms ability to have recourse there is nil.

Time to get creative

The good news is that it turns out that any effort beyond token prevention in fact hinders your ability to stop piracy. Pirates are inherently lazy, and you can exploit this to get a handle on the problem. For example, I once worked with an IP based licensing scheme that also gathered OS fingerprints passively, but did not do enforcement based on the latter. This allowed some people to feel they were quite clever running a number of instances behind NAT. Periodically they get rounded up (random reinforcement works best for operant conditioning) and told they're gonna get a lifetime ban unless they buy the right number of licenses and sign an NDA about the incident. This was but one example of many where over the years where laying traps for pirates paid off quite handsomely.

Just like with my previous article about the victory of spam, the proper mindset is not to fight but "make the trend your friend". The motivations for piracy and spamming are both deeply ingrained in human nature. The most powerful people and organizations in the world have fought that war against our baser natures for millennia and are still no closer to victory then when they set out. This time will not be different.

25 most recent posts older than 1635962678
Size:
Jump to:
POTZREBIE
© 2020-2023 Troglodyne LLC