Tuesday, July 5, 2016

Generating Content with Service Workers

In this talk, Ben Foxall covers how service workers can be used to gather, process & generate content in the browser. He shows a practical use case of using this to drive web-based visualisation of geo-location data.
This talk was part of JS Monthly London April event.

[00:00:07] I’m Ben, I’m BenjaminBenBen on Twitter. I’m going talk about serving data from browsers. A bit of a spoiler, this is going to end up saying that service workers are kind of cool, which is what Phil’s just told us for half an hour. Thanks, Phil. Thanks for that. That was really handy. I’m not actually going to talk about as broadly or as well as what Phil just did but I’m going to talk about one specific use case that you might use service workers for.
[00:00:42] Actually, I’m not just going to talk about just purely service workers, I’m going to talk about things that you can use in service workers but also elsewhere. It’s just basically a deep dive into a particular project that I found quite interesting to work on. The thing I was working on is how you take your data that you’re collecting and you’re pushing up into some web API and how you can get that back down and in your hands. The goal that we’re going to do is we’re going to look at as a user, I want to have my data as a CS feed document, so that’s just like a tangible end goal that you can aim towards.
[00:01:20] The two data sets that I’m going to talk about are Last.fm and Runkeeper, so I think everyone knows what Last.fm is and Runkeeper is this run tracker thing that you put on your phone and it gives you GPS traces. I’m going to switch between those two. I’ve built these two projects that export that data in the way that I’m going to go through just now. One of them does Last.fm and that uses standard web technology and the Runkeeper-to-CSV uses service workers.
[00:01:56] Just a bit of an introduction to the problem. We’re going to start with Runkeeper and just think about how we get to our data. Our goal is to get a CSV document of our activities or our coordinates as we’re running through. Let’s go through how you might do that.
[00:02:15] First of all, we might go to Runkeeper.com and try and download it. A lot of services has some kind of page that you can put in some dates and press the export button and then you get a zip file with all this data in it, which is cool. Right? I’ve got these four activities that were in that range and a couple of CSV documents. This is cool. The good things about it is we’ve done it. We’ve got our data downloaded and that’s what we were trying to do. Another thing is, once we’ve got that data downloaded, it does work offline, essentially. We can plug it into XL if we want. It’s not an online resource. It’s continuous. The bad things, this signifies bad, is we don’t have any control over that format. Right? We got that down in GPX files. I’m not really sure what that is but we want it as CSV files or maybe we want it as Geo.JSON files and we don’t have control over how that’s exported. That’s all to do with Runkeeper and how they’ve chosen to export it. That functionality might change or disappear as well, so for instance, in the Last.fm newest site launch, you’ve lost the ability to export your data which is a bit sucky but you can use an online browser based tool to do it.
[00:03:37] Cool. That’s what we’ve got with downloading it straight from the site. Attempt number two, we might want to write a script, okay? We write a batch script that goes to Runkeeper, we plug in a token and we can download our scripts. This uses a really cool tool called, “JQ” which allows you to pull data out of JSON documents and doing this you can pull out the URLs for all your activities and then plonk them into a file. This is up on GitHub gist, so that other people can share it, download it, run it themselves, which is cool. This is great! We’ve got traces over format, we’ve got the raw data so we can cut it however way we want using our tools. It’s sharable. You can go on GitHub, you can put up your own ways of tweaking it, which is great. It’s also offline in a way. As soon as you’ve started downloading that data it’s there on your hard drive and you can play around with it and stuff like that. Even if you’ve only downloaded half of your activities, that’s fine. The bad things, this is super inaccessible in a way. If you think of all the people that could benefit by having their data in a CSV file and you think of the people how can run a batch script and know how to install JQ and know what a curl command is or whatever, then that’s a tiny little proportion. Yes, scripts, although they’re cool and useful, are inaccessible, which brings us to attempt three.
[00:05:08] Attempt three, we make a web service to do this. We make our own thing that luckily both those services have APIs, so we can make a web service that goes to those APIs, pulls it down, and gathers it together. That might look something like this where you silent Runkeeper in the background, it downloads all your activities and then after a little while you can come back and download your data in the same way as you do on the Runkeeper site, apart from we’ve implemented ourselves using that API. This is super cool. We’ve got accessible. We’ve got those format options from before. It, again, works offline because you’ve downloaded it once but if you’re trying to download a different cut of it, then you’ve got to be back online. The bad things about this is that’s actually hard to make. It’s not a tiny little hack project. You’ve got to think about how to deal with hundreds of concurrent requests, how to queue stuff up, how to notify users when stuff is done – it’s difficult. Also, there’s a potential for you having to store sensitive data on your back end. For instance, with running data, you can find out where people live, where they’re running to, which is sensitive stuff. You’ve having to deal with that issue by storing it somewhere that they can download it. Also, as an end used you’ve got to wait for that process. Before, when we had that batch script, as soon as stuff streaming down you can start tweaking it and looking into it; whereas, in this scenario, you’ve got to come back in a few minutes and then you can download it and see what it’s like.
[00:06:48] Which brings us to attempt four, which is to go client side, right? This is obviously where I was going because we’re at JS London. In that web service we’ve got, your browser would hit the web service which would then go to Runkeeper and download all the stuff. Whereas, another option we’ve got is for our browser to authenticate with our service and get the details it needs to go to run Runkeeper itself and pull down the data into the browser itself. This is what this might look like and what it does look like. We’ve got this page that’s downloaded 175 activities here and you can download these different cuts of it down here. That’s what you can do by bringing it client side. We’ll get into more details of how to do this. The good stuff about this is that the data is stored locally, you don’t have any of those issues with storing them on your server. It’s available instantly. As activities are downloading, you’re able to download cuts of that data, so once you’ve downloaded two or three activities you can download a summary of those three activities, you don’t have to wait for it all to be compiled and aggregated on a server. If that doesn’t everywhere or it might break sometimes but that’s cool.
[00:08:11] Right, okay. We’re not going to cover that in too much depth. The things we are going to cover is what we need to do in the browser to make this happen. The first step is we’re going to request the data from our source, either Runkeeper or Last.fm and we’re going to take that data and we’re going to process it. Convert those JSON into CSV documents, for example, and then we’re going to serve it to a user. We’re going to look at some of the options we can do with that. For this, we’re going to use Last.fm, so just as a nice simple example of how you might – I take it everyone has heard of Last.fm? It’s like a music storage thing, so everything you listen to gets pumped up to Last.fm. The first one, requesting. What we’re using here is the Fetch API and we’re just wrapping around that to make an easy way of hitting the Last.fm API. We’ve got a hardcoded key in here and this is just a simpler way to wrap our requests. What I quite like about – this is something we’re using, yes, 2015 here and we’re using the arrow functions. I really like how this makes it really easy to compose promises. Stuff like this turns weirdly clean. Yes, that’s cool. That’s our abstraction over the API.
[00:09:42] Our next task is to get all the requests that we need to fulfil a user’s data set. For here we’re pulling out the first page of their data, we’re finding out the number of pages, and then we’re generating a number of requests for each page. That will resolve to something like this. This is all the pages of data that relate to my username, so I can then download those. Then we just need to make them. To get the data for a user, we just call that function and then submit all those to Last.fm through our API wrapper that we made before. Pretty straightforward. You have to do something with to throttle the requests but that’s a bit boring, so I’m not going into it but you can work it out.
[00:10:38] Okay, so that’s that. What we’ve got is we’ve got all our requests, we’ve got our data in these JSON objects. What we’re going to do is we’re going to process them and what we’re trying to do is turn it from JSON to CSV. There are a lot of different libraries that do this CSV generating and it’s hard, it’s a difficult thing to do completely right. If you want to use something like this or you can just hack one together, which is what I’ve done here. This, in a hacky way, just generates CSV rows. It cuts out all the speech out of stuff. Yes, you can see at the bottom you’re basically passing some information, what keys you want to pull out and then it will give you a CSV row, which is handy. Yes, now we’ve got this process function so we can pull out that data. We can map all these requesters through that, so basically instead of getting the raw JSON files back, we’re getting this CSV string. That’s cool. We’ve got that data in our browser.
[00:11:54] The next part is to serve it. One way of doing this is data URIs, so that’s a URL or a form of data and then the media type, so you can serve any content through this. For instance – has everyone heard of data URIs? Cool, a few people. What you can do is you can put this in your browser and it will serve that as if it’s content that it’s viewing. You can actually type this in and you’ll get a HTML document with, “Hello World” in it. We can do this for CSV documents. If you link to this, then that will automatically just spark a download in your browser. You’ll get that two-line CSV document with ABC, 123 in it here. We can now put this all together. We’re generating that data, passing it through our Last.fm thing. We’ve got this data coming through and then we’re converting it all to CSV and this is entirely it. We’re basically creating an element with that link and then setting the href to that data URI. If a user clicks on that, it’s cool. We’ve got a bit of problem with that slide, which is our CSV variable here is like a huge, humongous string which feels a bit ugly and especially Last.fm data, so I’ve got 190 different rows in my CSV document, so this becomes huge. Then passing it into data URI you’re creating a new open that’s even longer. The decision to this is to use BLOBs. This is a way of storying strings of data in an immutable way that they’re not in their JavaScript heap anymore, so it’s a lot more efficient for creating files and stuff like that. It’s kind of like file. The way you create this is that you initialise a new BLOB with this array of strings or texture rays or whatever. We can take this and basically when we’re putting on a document here, this URL, create object URL takes that BLOB and then gives a short URL that references that BLOB, so as soon as you visit that URL in your browser, you’re basically viewing that generated data. You can do some really nuts things with this. One of them you can do is you can set that as being a large CSV object and users are linked to that it will download. That’s a bit more efficient way of doing it and nice.
[00:14:40] Once again, we can use this. We can convert all of our string concatenation to be BLOB concatenation and suddenly it’s super nicer. At the bottom here you’ve got this text CSE so that gives a mime type for the browser to know to download it straight off and what to do with it. Here we change that data URI to create object URL and it all works exactly the same, which is nice. What we get over there is this request process and serve these three things. We’re requesting it by expanding all the requests. We need to pull down data for a user, processing it by passing that data through our API wrapper and serve it using data URIs and BLOBs. That works and that’s all up online and has actually been weirdly popular because it’s a generative thing. I get a lot of users asking me questions about it. I think it’s because it’s not a final – I’m not showing people graphs, I’m showing them, “This is your data. You can do your own things with it.” I’ve seen some people doing some really interesting stuff with that, which is good.
[00:16:00] This is where we get on to service workers. The cool stuff, that all works and that’s totally useful and good, but as soon as you navigate away from that page, your BLOB URLs are lost, which is a good thing. Actually, you don’t have a persistent URL so you can’t point static scripts to plot that URL or bring it into a different script somewhere. Also, it doesn’t work offline. You’ve got to have an active connection to Last.fm every time you want to regenerate that data, which is where service workers come in. Yay! Which are great. Yes, indexddb as someone asked before, I think it was you, yes. Only use indexddb because it works quite well for this when you’re pulling down a lot of data and trying to cut it in particular ways. Equally, you could use the cache for it, if you wanted. Our process here, we’ve put in a few service worker parts, which is first of all saving it into our database and then processing it and then caching it when we do that processing and then service it. You’ll see why that is useful. Now we’re switching to Runkeeper data. First up, we’ve got that request which is exactly the same as the front end. As you pointed out, we’re on HTTPS now. That’s necessary to run this service worker because we’re running it in there. Because Runkeeper was authenticated, we’ve got a node app on the back end which does some stuff with Runkeeper, I think.
[00:17:41] Saving it, which is a new part. This is partly the same with the Fetch, apart from we’re shoving it into this indexddb instance. I really like using “Dexi” which is a really minimal wrapper around indexddb, it kind of promisifies it and makes it quite easy to do these migrations between two different versions. Here we’ve created a store called, “Activities” on a database called, “Runkeeper” and those activities have got an index of URI. What we’re storing is we’re storing – essentially we’re chucking JSON objects at it and in this case it happens that our JSON objects have a URI key on it, so that acts as a primary key. You can add as many different keys on that activities store and you’re able to query by them. Has any used Dexi before? No, cool. You should check it out. It’s really nice. Then we’re just putting it into Dexi and because Dexi goes around promises, it works really well with Fetch promise stuff that I was saying before. When you open up your inspector in Chrome, you can see the data so you can query it briefly and you can see that URI over here, that’s the one that we’ve able to make calls on at the moment. Basically, reiterate through those URIs and pull out all the data later on. Right, okay, which brings us to the processing part where what we’re doing is we’ve got our data and we want to process through it and convert it into a CSV document.
[00:19:23] Again, this has got similar parts. What we’re doing is where you can see VAR data = new BLOB, that’s exactly the same code as before. We’re creating a BLOB of that CSV row and we’re concatenating that together but then instead of making going through those requests and doing it there, we’re actually reiterating through a collection of activities through that database. It’s logically the same but it’s got this different wrapping around it. This allow us to instead of going through the requests, do it all locally and offline and will work in a service worker. The other thing to notice as well as the BLOB, we’re returning a response with that BLOB in it and that’s taking that BLOB and making it into a response, like something you’d get from a Fetch request and that you can shove into your cache.
[00:20:26] Which brings us to caching, which Phil touched on a lot better than I have with this example.Basically, this is an example that I use in the Runkeeper data. What it does is it tried to use a cache version and if it can’t do that then it will populate the cache, so it will fire off a call to it and populate the cache. This way you can pass different listeners under different routes and it will be cache the next time the user comes to it.
[00:21:00] Which brings us to service, serving is a new one and it’s all about the service worker, right? What we’re got is this advent listener which you’ve implemented and we’re catching anything that with summary.CSV and then we’re calling that respond with the event and that summary responder thing which I showed much before. What this is doing is when that URL gets caught, it will send a CSV summary and it will cache that nicely and stuff, which is cool. The part that you totally win, that basically this is all about, is in that same way you can add these different routes. Here we’ve got summary.CSV and we’re got distance.CSV and those will be caught by different listeners and hit exactly the same database. Although, because of the service worker and because the way we’ve architectured this, we’re not having to re-request any data from anywhere, as soon as you’ve got that summary.CSV file available, you’ve got all your distances and paths. Another one that it quite interesting down here is our Geo.JSON response, so we can format it in drastically different ways. There, rather than launch latitude paths, we’ve got Geo.JSON files, so it would format it in that way. We’ve got that choice that we wanted at the very beginning of how we own our data. Then down at the bottom I’ve got a binary path response, which is like a really hard-core typed ray thing, which you can load straight into 3JS and so you’re able to have that control over what you build. A happy coincidence of this is the fact that it all works offline. Suddenly, once you’ve downloaded that data once, you can come back to it with shaky network connection and you’ve got it all there. It’s super-fast. One of the nice things with indexddb is that it can take a little while to warm up and so by creating that from a service worker, you’ve got that instantly. You requesting the service worker which is already got that database open, so it gets super-fast, as well as the caching. It’s super-great, which I touched on as well.
[00:23:37] I don’t know if I’ve got time for a couple of demos of this. Okay, cool. Just to show that it actually is a thing and not just a bunch of slides. This is that page – so this is running locally – but what this is doing is it’s populating that indexddb instance. This is actually hitting it, not through a service worker but through a standard worker, but using the exact same code. Now, down the bottom, we’ve got this distance.CSV and that gives us a raw CSV document, that you can’t see, but it’s real. It’s not that. Okay, cool. You’ve got these different cuts of the data. You can see that first one, that was generating it, the next time we hit it, it will be faster. That’s how you can organise the data and allow it to be downloaded. I’ll just show you quickly. I was supposed to link into this. This is what we’ve done. Right. You can see this is us accessing that data. Right, this is us accessing this data through the service worker and we’re pulling that data out of that summary.CSV and we can do that through a regular effect request, but what’s nice about this is that this is using that exact same route to visualise that data. It’s all cached, it’s all nice, and so is this. This is like the amount around the earth I’ve run and that’s using that same end point. Suddenly, I can use all these D3 visualisations that I’ve already got and just apply it to my live data set and I can make is appropriate to a particular person. Then we’ve got these different types; these different cuts of the data. Using the exact same backing data, we can see kilometres versus time and also we’ve got a lot richer data as well, so we can see all the intra-details by loading another path. This is using two different end points as well. As I mentioned, Geo.JSON, this is using the service worker to access the data in a Geo.JSON format. Suddenly we can do some stuff with it. We can plot it using D3. This is every route that I’ve run. You can’t really see those difference very well. It just clusters it together and groups them together. You can see the places that I run, which is kind of cool. This is the main one which is Oxford. We’re basically using the service worker to view this data in a very different way. In a way that we drew that standard graph before, we’re able to draw in this rich D3 environment.
[00:26:58] As I mentioned, we’ve got binary data we can shove straight into 3JS. This is 3JS using WebGL and I think it’s plotting 110 or around 110 different thousand different points. That’s Brighton. This is to get more of a feel of what this data is like and explore it. Service workers make that quite easy. Yes, there’s a lot more there and you can see the amount of data that I’ve got here. WebGL is the best way of visualising this data here. Cool. That’s all up on GitHub, you can check it out and stuff. It’s also online if you want to download your data and see what you can do with it, that would be good. Profit. I don’t know what I mean by that but it seemed like a good thing to have on slides. Thanks a lot. Cheers.

No comments:

Post a Comment