Scribd – A Goldmine of Sensitive Data

1 month ago 6

Press enter or click to view image in full size

Ever wondered what could possibly go wrong when a digital document library allows users to upload whatever documents they want, letting them believe it is similar to Google Drive? Let’s just call it a recipe for disaster.

When Edward A. Murphy Jr. mentioned in his famous “Murphy’s Law” that whatever can go wrong will go wrong, he was right, maybe because he knew that people wouldn’t think twice before sharing their PII over the internet without clearly understanding its consequences.

Anyway, enough of the story-building, let’s dig deeper and see what happened and what we found out.

Introduction

Before diving straightaway into the technical aspects or numbers, it’s important that we first know what Scribd even is in the first place.

Scribd is a digital library housing over 195 million+ documents from users worldwide. These constitute ebooks, audiobooks, magazines, sheet music, legal documents, and PII documents tsk tsk.

The service has an “Upload to Download” freemium plan, where users can upload 5 documents to download a document they desire. Well, exchanging bank statements for a favorite magazine sounds pretty good right? Right!?

So yeah, back to the story. One fine day we thought, hmmm, maybe there should be something more than those boring ebooks on this platform. We were surely not surprised after seeing all the documents that were returned after we searched for “passport”. Our first reaction was that there must be way more than just passports if people didn’t think twice about these.

Every next query that we searched on the platform were returning tons of interesting and at the same time, sensitive results. At last, we realized, yeah that’s it, we need some research to go out about this, people need to know what they are exposing over the web and if through this blog, we are able to send a message to the platform itself then this would come as a big relief to everyone!

Scraping Out the Data

We wanted some numbers to show the impact from these leaks and definitely just staring at the screen and counting the entries would be the dumbest idea. We had to scrape the platform.

First, we decided on the metrics based on which queries would be done, these were:

Documents must have been posted/uploaded on the platform in the last year
Documents must have between 1 to 3 pages since this is the most likely length of files that would host PII (this is a tradeoff for cases like bank statements)
Document names/URLs should contain the name/category of PII we are looking for (Eg: https://www.scribd.com/document/12345/passport-pdf)

Now since we have the metrics the challenge is to fetch the documents from the platform, how do we do this? Unaware of any public APIs offered by the platform, we looked everywhere to find a good solution to achieve what we were trying to do, but to no vail.

Until.. one day when we were searching for documents again out of curiosity, we opened up the inspect element and voila! we discovered an API endpoint that allowed us to retrieve the document URLs and information programatically.

Press enter or click to view image in full size

Screenshot showing the API endpoint we discovered for retrieving search results

The endpoint returned a maximum of 42 results per page which was iterable. So now, all it took us to get to the data was triggering cURL requests.

Press enter or click to view image in full size

Screenshot showing the cURL request we used for retrieving Passports

In the above one-liner, you can see that we iterated between 1 and 12000 (considering it to be the highest available page number, though most likely it used to stop by the 200th or 400th page). Once we had the results from the cURL request, we used jq to parse out the JSON response body to retrieve the document URLs and append them to a file named urls.txt, we repeated this process for several keywords/queries that we believed would return something sensitive.

The Results

Enough of the nerd stuff, let’s talk about numbers, and the results we retrieved from these scans.

Press enter or click to view image in full size

Infographic showing the results uncovered in the research

We successfully managed to retrieve a total of 13,197 unique PII documents targeting 15 different categories affecting users worldwide. What is more concerning is that our categories were still pretty much limited and only had sample data for the last year, consider this to be the tip of the iceberg!

Our most frequently encountered document was offer letters! These documents not only contain the bags of cash dudes will be printing but even their names, addresses, emails, and what not. But then this wasn’t it! There was way more stuff on this platform than just offer letters. We found bank statements holding financial transactions, bill invoices with addresses, WhatsApp chats PDF exports, vaccine certificates, 2FA backup codes, and visa documents, the list was and is never-ending!

But seriously 2FA backup codes being uploaded to a digital library? This means that their users don’t understand what they are doing. Imagine a government employee uploading a 2FA backup code to their mail account on here, it can be catastrophic.

While uncovering this disaster, we also came across a job scam being hosted over Scribd in the form of offer letters,

Press enter or click to view image in full size

Image showing a potential job scam being run over Scribd

In the above image, you will notice the similarities in terms of designs in all of the above offer letters posing as different institutions/entities. This would be an interesting case to further research and if you ever decide to work on it, be sure to let us know :)

The Solution?

As from a user PoV, we need to be more cautious of what we upload on such platforms and if we do, check whether or not they have access controls in place to restrict unwanted access to sensitive information such as these.

For Scribd and similar platforms? The below image summarises what I wish for.

Conclusion

So that’s it, folks! I hope that we were able to produce something interesting and insightful enough. This research will be one of our various ways to provoke a sense of awareness among the crowd about security and ofcourse, the implications of posting PII over the internet.

If you have any suggestions/feedback/questions or just anything, feel free to reach out through my email [email protected] or LinkedIn.

See you soon with some new research or tools as always, until then, happy hacking!

Read Entire Article