News:

Welcome to the new (and now only) Fora!

Main Menu

Research involving internet data

Started by stupefied, June 21, 2019, 06:38:54 AM

Previous topic - Next topic

stupefied

(I posted this on the CHE fora, but found out the CHE fora will end very soon.)  I thought I would repeat the post here.

If information is publicly available on the internet, does that mean we can freely mine the data from a website for research purposes?   The unit of analysis is not an individual. It will be text analysis, supplemented with quantitative analysis.  The authors of the text are truly anonymous for viewers, with no names identified.  There is even no user name/psuedonyms.  The website is public and unmoderated; viewers can access the website posts even without being a member. 

polly_mer

Welcome!

I'll repeat my advice here for readers:

1) Check with your IRB because there are still humans somewhere who wrote the text.

2) Check with your colleagues on whether what you propose will count as publishable research.  A convenience sample is increasingly frowned upon for scientific research due to the lack of an adequate control group with strong bias for who participates in internet forums.
Quote from: hmaria1609 on June 27, 2019, 07:07:43 PM
Do whatever you want--I'm just the background dancer in your show!

Puget

This pretty clearly doesn't meet the Common Rule definition of human subject, because you are not interacting with the individuals nor obtaining identifiable private information, so I don't think there is any role for the IRB here:

Human Subject
A living individual about whom an investigator (whether professional or student) conducting research obtains data through intervention or interaction with the individual, or identifiable private information. 45 CFR 46.102(f)

Whether the research methods are solid is a different question and depends entirely on your research questions, specific methods, and what populations you are trying to generalize to.
"Never get separated from your lunch. Never get separated from your friends. Never climb up anything you can't climb down."
–Best Colorado Peak Hikes

stupefied

It is not the IRB issues that concern me.  Rather, I had assumed that data that was publicly available--particularly on unmoderated sites--were available for use.  The literatures even indicate that other scholars mined the data from the same website.    I didn't even bother to read the terms of agreement.  Problem is, I already collected and analyzed the data.  And even received a book contract.  Am I screwed?  (I did attempt to contact the site, but there is no direct email or phone number.)

polly_mer

Quote from: Puget on June 21, 2019, 07:22:47 AM
This pretty clearly doesn't meet the Common Rule definition of human subject, because you are not interacting with the individuals nor obtaining identifiable private information, so I don't think there is any role for the IRB here:

Human Subject
A living individual about whom an investigator (whether professional or student) conducting research obtains data through intervention or interaction with the individual, or identifiable private information. 45 CFR 46.102(f)

Some IRBs still want a formal piece of documentation BEFORE ANY research begins to ensure that all research has at least one other person with no stake in the research confirm that this research is truly exempt and outside the purview.  After an institution gets a couple of nasty surprises, they often become much more strict about bumpers before the actual violations can occur.

Why are those IRBs so leery?  Well, they've heard too many stories that go:

I've already done the research and I'm ready to publish.  Now, I have some questions about the rules since I assumed there were no problems and now people are asking some questions for which I don't have the answers.
Quote from: hmaria1609 on June 27, 2019, 07:07:43 PM
Do whatever you want--I'm just the background dancer in your show!

Puget

Quote from: polly_mer on June 21, 2019, 11:57:08 AM
Quote from: Puget on June 21, 2019, 07:22:47 AM
This pretty clearly doesn't meet the Common Rule definition of human subject, because you are not interacting with the individuals nor obtaining identifiable private information, so I don't think there is any role for the IRB here:

Human Subject
A living individual about whom an investigator (whether professional or student) conducting research obtains data through intervention or interaction with the individual, or identifiable private information. 45 CFR 46.102(f)

Some IRBs still want a formal piece of documentation BEFORE ANY research begins to ensure that all research has at least one other person with no stake in the research confirm that this research is truly exempt and outside the purview.  After an institution gets a couple of nasty surprises, they often become much more strict about bumpers before the actual violations can occur.

Why are those IRBs so leery?  Well, they've heard too many stories that go:

I've already done the research and I'm ready to publish.  Now, I have some questions about the rules since I assumed there were no problems and now people are asking some questions for which I don't have the answers.

Yes of course you should follow university policy, but the Common Rule is pretty unambiguous here. It's not even that this is an exempt category, which may* require a determination of exempt status through limited IRB review-- it's quite simply not human subjects research according to the Common Rule definition.

I do understand the CYA impulse on the part of some universities, but its really a waste of limited IRB time and resources (which leaves those of us who do real human subjects research waiting for review).

* The 2018 revised common rule regulations were meant to move IRBs toward allowing self-determination of exempt status unless sensitive identifiable data were involved. Implementation of this has been uneven (see CYA impulse).
"Never get separated from your lunch. Never get separated from your friends. Never climb up anything you can't climb down."
–Best Colorado Peak Hikes

polly_mer

CYA has been very strong everywhere I've been.  The IRB may not be the best mechanism for all research-related problems, but it is often the mechanism chosen.  Another consideration is the federal regulations only cover federally funded research, but the CYA aspect means the local IRB often is tasked with all research.

Pigou on the other thread on the other website points out the legalities of scraping data from a website.  Again, that type of legality may not be the official purview of the local IRB, but CYA for the local professor can often best be done by sending a proposal through whatever mechanisms exist to get someone to sign off on marginal activities.

I think of the problems that made national news for various researchers in the past decade related to charges of research malfeasance and sigh heavily about how few of them went through any kind of internal review early enough that adjustments could have been made to be within legal and local rules. 

To emphasize the lesson for the casual reader: The time to ask questions, even just related to local rules, is during the planning stage of the research, not once the research is done and the publishers are asking questions.
Quote from: hmaria1609 on June 27, 2019, 07:07:43 PM
Do whatever you want--I'm just the background dancer in your show!

stupefied

Just to give some more context:  The publisher did not question the data scraping; in fact, I don't think the editor or reviewers even care or thought twice about it. 

The reason for my concern is that I recently went back to the website and look at the finer print.   I was not a registered member of the site.  You don't have to be a registered member to view the "data"; as it is open to the public.  The site is public and UNmoderated. 

Also, the reason I never thought it was an issue was that in the course of the literature review (well before the data collection even began), there were peer-reviewed journal articles in which the authors also clearly mined the data from the same website.  In some cases, they also constructed a dataset, whether similar to or different from the dataset I constructed myself.


Puget

Quote from: stupefied on June 22, 2019, 04:59:09 AM
Just to give some more context:  The publisher did not question the data scraping; in fact, I don't think the editor or reviewers even care or thought twice about it. 

The reason for my concern is that I recently went back to the website and look at the finer print.   I was not a registered member of the site.  You don't have to be a registered member to view the "data"; as it is open to the public.  The site is public and UNmoderated. 

Also, the reason I never thought it was an issue was that in the course of the literature review (well before the data collection even began), there were peer-reviewed journal articles in which the authors also clearly mined the data from the same website.  In some cases, they also constructed a dataset, whether similar to or different from the dataset I constructed myself.

This sounds like a question for a lawyer with copyright/corporate data use expertise. Might you be able to speak with someone in your university's intellectual property office (presuming you have one)?
"Never get separated from your lunch. Never get separated from your friends. Never climb up anything you can't climb down."
–Best Colorado Peak Hikes

pigou

Just go ahead with the publication... the chances of getting sued by an unmoderated website are as low as imaginable -- and it's just not worth forgoing a book publication to eliminate that risk.

Very few places are going to spend tens or hundreds of thousands on legal fees where really the best outcome for them is that other people will be less likely to do it. A website without a contact info is even less likely to do so. What kind of harm are they going to try and show in court? And if they were motivated to do so, IRB approval won't do you any good. Neither would the university lawyers: they'd obviously be motivated to protect the university and not you.

stupefied

So I got a response from the website, after seeking permission to mine their data.  I honestly did not think I would get a response, seeing how the website is owned by a corporate conglomerate. Here is their stock response:

Thank you for contacting ______ and for your interest in our site.

As you know, it is against our terms of use to leverage our data externally. I have consulted with our management and unfortunately we are unable to make any exceptions on this. Kindly note, this policy is in place for the security and privacy of all our users.

I hope this information helps and thank you for your understanding.

All the best,

--
_____ Support



Curiously, there was no actual name/representative in the response, which I thought was quite odd.

I contacted the IRB at my institution, and it definitely is not a human subject issue.  The IRB administrator will contact the general counsel at the institution.  The publisher also will consult with the publishing house's legal counsel.  Hopefully, I will hear back by the end of the week.


pigou

I'm a little bit surprised by your efforts to track this down... there's really only a downside for you and you've just triggered the first one: now you're explicitly violating their terms and can't feign ignorance anymore. Deliberate ignorance is a big part of law.

Your university's general counsel doesn't care about your book contract. They care about eliminating every risk of the university getting sued. So I'm fairly sure they'll tell you not to do it.

Then what? You're now explicitly ignoring your own university counsel AND the website to publish the book anyway? You're not going to publish your book?

There's really no upside to this. Only downsides.

stupefied

Feigning ignorance--or even real ignorance about a website's terms--does not protect an ignorant individual.   And my ignorance occurred AFTER the data collection.

Would a website sue an individual for violating the terms of contract?  I would think they MIGHT only if they can prove that damages occurred--for instance, if my book causes people to sue the site or causes lost profits.  Neither one of these scenarios are likely to occur, particularly when the data cannot be copyrighted. 

I'm trying to see if there is some way around this.

pigou

Quote from: stupefied on June 26, 2019, 10:38:05 AM
Feigning ignorance--or even real ignorance about a website's terms--does not protect an ignorant individual.   And my ignorance occurred AFTER the data collection.

Would a website sue an individual for violating the terms of contract?  I would think they MIGHT only if they can prove that damages occurred--for instance, if my book causes people to sue the site or causes lost profits.  Neither one of these scenarios are likely to occur, particularly when the data cannot be copyrighted. 

I'm trying to see if there is some way around this.
I'm not a lawyer and I'm not your lawyer. But there's a big difference between "ignorance of the law" (e.g. "I didn't know stealing things was illegal!") and "ignorance of fact" (e.g. "I believed the chair on the street was abandoned and not someone moving."). You'd have to show that you genuinely believed the data were freely available, which you might be able to do by pointing to other work relying on the same data. You can no longer do this: you now know it to be against the terms.

If your university counsel tells you that you can't use the data -- because they worry about the cost to the university and don't care about your book -- then what? Are you going to contact your publisher and pull the book contract? Are you going to ask their lawyers for an opinion -- and see if they care more about your book than protecting the publisher? And suppose they said it's fine: would you ignore the university counsel and risk getting in trouble with your employer, even if nobody from that website ever reached out?

The absolutely best case for you is if the university counsel tells you it's fine to go ahead and you're exactly where you were before you started this inquiry (minus an ignorance of fact defense if you ever do get sued). The really bad outcome is that they tell you not to use the data and you don't publish your book based on the data. The worst possible outcome is that your university counsel says you can't do it, you publish the book anyway, and they fire you not because the website complained, but because you directly went against their guidance.

stupefied

What is frustrating is that I tried to do the right thing after I learned of my innocent error; yet, if I had not reached out for permission, maybe things would have been easier? 

But I wonder how many people have published works (well, certainly journal articles have been published but not so much books) using that website's data? 

The other thing is that the website forbids commercial gain from use of the site's data.  I don't see that as a problem; I can simply just forego the advance and any royalties from the book.  My intention is to use the book for academic purposes, not to make money (especially since academic books tend not to sell well anyway).