Datashare – Interesting tool developed by ICIJ

Have you ever encountered the following?

You have read a bunch of documents. Think a lot of documents. But after reading, you are thinking of a specific detail. A detail you have forgotten. The name of a company or product for instance. And really, you are not interested in reading all the documents again. So you start scimming the texts for the missing detail.

Good news – now you won’t have to scim it your self. Let Datashare do it for you.

Will Fitzgibbon of ICIJ has written a pretty good guide for Datashare, you could start by reading.

ICIJ is the International Consortium of Investigative Journalists. The organization behind the Panama Papers, Lux Leaks, Off shore leaks, Implant files and many other large scale international collaborative (data) journalism projects.

We are currently attending the European Investigative Journalism Conference (EIJC19) in Mechelen, and will write about interesting software usable for research purposes.

Link: Datashare on ICIJ
Link: Dataharvest/EIJC

Collecting data from Facebook pages

If you are interested in collecting and analyzing data from Facebook pages we hereby provide a short how-to guide. You can only collect data from public pages so that means no automated collection of groups or personal profiles. You might experience some issues with posts that are not collected. Always check that the data you collect appears to be complete by going to the facebook page and comparing numbers of likes/comments/shares and posts. I you experience any issue, then try to collect smaller amounts of data in a shorter amount of time. You can always merge multiple data sets later on.

This guide requires Excel and a Facebook account.

  1. Find Facebook ID
    • Go to the Facebook page that you want to collect data from via a web browser
    • Copy the URL address
    • Paste the address into the available space at https://lookup-id.com/
    • Copy the number (Facebook ID) that you receive through this process.
  2. Collect data from Facebook pages via Netvizz
    • Go to the app Netvizz via url: https://apps.facebook.com/107036545989762/
    • Install and accept permission for the app. This is a university app developed by the University of Amsterdam, and your data is not stored on their servers.
    • Press the link Page posts and insert the FB ID
    • Choose date and other collection details. Leave the posts statistics only selected unless you are specifically interested in the content of comments
    • Press posts by page only or posts by page and users
    • Scroll down to the bottom bellow both graphs. This is tricky on a Mac, since you might not see the option or scrolling. Bellow the two graphs you will find the download link saying Download link as zip file. Press that link to download the file.
  3. Open in Excel
    • Unpack the zip-file by double pressing it.
    • Choose the file that is NOT called statsperday.
    • Open the the file by right clicking and going to open with click and other.
    • Select all programs rather than recommended programs and then find your Excel program.
    • Click the Excel program and view the data via Excel.

You should now have an overview of the collected data via Excel sorted into columns and rows. If the data does not appear to be sorted into columns then select column A and press Data and then text to columns. Choose delimited and than select tab and press finish.

Watch two how to guides from the developer himself. The interface has changed a little since then, but overall it is very similar.

Link to YouTube video

 

Scraping 101 and basic programming concepts

Credit: Screendump from http://sumsum.se/posts/scraping101-part2/ by Mikko Helsig

Ever wondered how scraping works? If you are pretty much blank when it comes to programming, this guide is probably not for you. However, if you have the basic concepts in place, in a few steps the author, Mikko Helsig, shows you how to scrape a site in Python (and also how to install Python in a Windows environment).

Prerequisites are a basic understanding of programming. But then you get a concept of how Python works, how scraping works and the really cool libraries requests, requests_cache, BeautifulSoup and Gender (the latter is a library used to guessing and parsing gender of names).

Link: Scraping 101

If you are totally new to programming, we encourage you to start by learning a little Python. There are numerous places to do this. For instance:

We also deeply encourage you to start with programming in an environment such as Anaconda. A short description of the Anaconda Navigator can be found here.

Get data from Instagram with Instaloader

We have added Instaloader to our External resources.

Instaloader is a tool to download pictures (or videos) along with their captions and other metadata from Instagram. You can either download profiles or hashtags, and it’s possible to set up filters (for instance datefilters, see below) to narrow your search.

To use Instaloader,  you should do the following.

  1. Download and set up a new Anaconda Environment with a Python version higher than 3.5.
  2. Install Jyputer Notebook on the environment and open a new terminal
  3. Do a pip (not pip3, as that does not work with Anaconda) install of the instaloader and dependencies
    pip install instaloader
  4. Create a new folder in your root-environment (typically documents-folder) called for instance Instaloader
  5. In terminal do
    cd instaloader
  6. This is to avoid that everything is saved in your base folder 🙂
  7. Run various command line commands in your terminal. Please do note that the interface is rudimentary but filters can be applied with the use of boolean expressions for instance:
    instaloader "#HASHTAG" --post-filter="date_utc >= datetime(2017,1,1) and date_utc <= datetime(2018,1,1)" --login=USERNAME
  8. We would love to implement this as a hosted service. However, it is not likely we will do so just now. Therefore, please experiment with it yourself. You can also ask our advice, and we will do our best to help. If you plan to use this tool on a regular basis or for larger datasets, you should probably be ready to use several user accounts and/or proxies to avoid being banned.

Pictures and presentations from the official launch

On 28/11 2018 we had our official launch with a seminar and get-together at our place.

We had a smashing time discussing the possibilities with you guys, and look forward to realizing ideas from the day. If you would like to relive the day (and who wouldn’t), you will find links to presentations and programme below.

Yourtwapperkeeper will soon be closed

Our hosted service for collecting Twitter data called Yourtwapperkeeper is currently retired and is having a lovely time at the old software home.

Currently the data can still be downloaded, but the server will be completely closed in January 2019. If you are using data from that server, you should download it by then.

If you need Twitter-data collection please use our TCAT-servers instead.

All data and ongoing collections from Yourtwapperkeeper has already been migrated to TCAT. Read more about the TCAT services on our Hosted resources.

Åbningstider på Digital Media Lab in 2018

I det mange tilfælde oplever vi, at studerende og forskere helst vil tage kontakt med os via e-mail, men i efterårssemesteret 2018 eksperimenterer vi med begrænset åbningstid.

Indenfor den begrænsede åbningstid vil Sander sidde på kontoret. Du kan således benytte lejligheden til at komme uanmeldt, men ofte kan det være en god idé at skrive inden, så Sander ved, at du kommer og eventuelt kan forberede sig på at løse dit problem.

Åbningstiderne er i øjeblikket hveranden onsdag i perioden 13-15. Det drejer sig således om følgende datoer i 2018:

  • 24/10, kl. 13-15
  • 07/11, kl. 13-15
  • 21/11, kl. 13-15
  • 05/12, kl. 13-15
  • 19/12, kl. 13-15

The first couple of tools are online

We are alive and kicking!

We are starting up our toolbase with some rudimentary tools and scripts. They are divided by Hosted resources and External resources.

Hosted resources are things we are hosting ourselves on local servers. Or scripts, that we have made ourselves and can assist you with running on your own computer or in the lab by appointment.

External resources are things we think are interesting, but really have no ownership over. Tools you can use, but we don’t officially support. You are free to ask advice on the tools too, though.

We also have a list of links on where you can find data. You could browse that list, but it is far far far from complete. So if you know a good link, please send us an e-mail.