Decentralized Technologies at the Internet Archive - Arkadiy Kukarkin - Web3 Summit 2024 Berlin

Published: Sep 04, 2024 Duration: 00:17:03 Category: Science & Technology

Trending searches: internet archive
hello my name is Arch Karin and I'm here to talk to you about uh our the joys and pains of working with the centralized Technologies at the internet archive so uh you probably know who we are but just in case um we're probably best known for the Wayback machine we've been archiving The visible internet since 1996 so white house.gov here we have it from 1998 we actually have it from before but that version didn't look very interesting uh we are uh we are also a real library was a actually a physical space as well that looks like this uh and we have books we have audio recordings we have film we have ROMs of old games and software uh just about every medium type you can imagine really try to be a universal Library um and we're not without our quirks for example uh as a three-year job perk you get a half-size terracotta statue of yourself uh made to be put into the great hle of the headquarters so uh that might give you a little bit of context for uh what the organization is like uh so why are we here at web 3 uh it might be a little bit hard to see but we have a particular motto which is universal access to all knowledge which you've taken literally can present certain problems but we broadly stand it and we've been involved in the uh space for some time now uh since about 2016 uh we've been running a serious called dwab Summit um named thus uh in part to you know allow this fine event to have uh its name space uh but also to indicate an inclusion of a variety of peer-to-peer Technologies as well as power balance Technologies and thought Frameworks in addition to what we think of as web 3 which is a little bit more of a crypto focused narrative so here is uh our team uh at uh 2016 uh the first Builder day at the first dwab summit uh you might see some familiar faces over there we got Juan we've got feros uh but we also have Tim burner Lee invent surf uh this is really joining off generations of pioneers of these Technologies uh and the series has continued uh this is from last year uh we now do uh Camp formats uh and as you see it's grown uh quite a bit I welcome all of you to join us at the next Edition most likely the next year okay so uh what is our motivation for working with these Technologies in the first place I would say it's twofold uh one side is uh an idealistic Drive uh to provide a better internet better web for all users um indiv so the the primary Concepts in this idealistic strain uh involve individual data sovereignty and privacy uh these are I think Cornerstone rights to uh a healthy digital ecosystem of any kind and should be table sticks uh they also involve the balance of power we're very concerned with uh the classic big Tech situation that end up with over and over again and a variet of subfields and we're also very concerned with interoperability as being essential to these other two goals as well uh however we also have an applied angle so we are a working Library we serve tens of millions of users and we do want to have these Technologies work for us in some capacity um and this gets a little bit contentious um and as a caveat this talk is uh a my personal interpretation and not an official view of the archive but I think the two things that we are concerned with and that uh the set of Technologies could really help with is storage and transmission of our archival materials so this is high volume stuff hundreds of fyes high cardinality meaning we have uh tens or hundreds of billions of individual records again depending on how you count uh and we have a very longtail access pattern so we have things uh that are never accessed we have things that are uh in extremely hot storage and there is not really a very very clear separation between the two are a good prediction strategy uh and lastly content authentication and access control is something that's very important to us and whenever you see a double asterisk like this uh that means that could be its own separate talk about double the lengths of this current one so what have we done so far uh the first project that I worked on at the archive uh was was ipfs I was actually working at protocol Labs at the time uh handling a number of uh partner integration uh deployment initiatives and what I saw was this one in particular was that uh it was just stuck and it needed someone really committed and really well-informed in those domains to make it work or to at least approximate that so after getting through uh the kind of surface problems we ended up with some U deeper ones uh the biggest one is that uh the cardinality of our primary Keys uh was about the same size as the entire apfs DHT at the time uh depending on how you count but it was certainly was in an order of magnitude uh in F more to announce the provider records to actually make the content available on the ipfs network uh even for 1% of our content would take more than the lifetime of those records of 24 hours and The Block store that actually stores the data on the back just could not keep up with the scale that we're trying to work at notably uh as of uh today a lot of these problems have been solved and they've been solved in kind of a funny way um which has basically been moving to Giant nodes and abandoning certain Concepts that had been kind of integral to what people thought of was ipfs uh some years prior so for example uh the dhg has been substantially supplanted or at least supplemented by ipni which is basically a giant machine where that knows where all the blocks are theoretically this is Federated in practice it's a little bit unclear um uh the actual data providers have also been moved uh or the larger data providers have shifted to a pattern uh that has been called in some circles elastic ipfs uh which again involves removing a lot of the guts of the system and uh giving out the block stores storing raw cars in S3 buckets and using Grange requests to read the blog data back funny enough that's actually very similar to how the way back machine works we have our craw stored in work files and we request the relevant segments through range headers so this might actually be something worse revisiting very soon uh The Next Step uh which was logical in some sense which is basically that if we can necessarily serve the data today we could at least try to store it um is file coin uh a key thing that I would like to remind you is that filecoin and ipfs are largely unrelated Concepts they just happen to be jointed the hip in an organizational sense and if you've seen me talk this is probably something that you've seen about and this is probably another double Aster here uh so how have we done with filecoin well it's a little bit hard to say because a lot of the onboarding happens to uh aggregator on wraps uh which we stubbornly refuse to use we insist on using the protocol pretty much directly uh and doing our own data preparation and working directly with service providers nonetheless we are current ly I believe the second um most um expansive institutional user of the system with some Tenth of thousands of deals paby stored and on chain uh we also run uh a small in-house mining operation which is very experimental as you can see uh this is the actual machine with a driveway on top uh it's in a little closet at the church that is our headquarters a house behind a rack of servers through a hidden door kind of a Narnia situation um and this actually maybe make sense for us uh because we are very stingy with our storage generally so we keep two copies of everything uh and don't use a resure coding any fancy techniques like that uh so perhaps a third copy that sort of pastes for itself and has something resembling what we call fixity in uh The archival Sciences uh could be a good play and we're ramping this up but it's still very small it's only about 100 tips of of power um we've also worked with a lot of other storage Technologies uh so web Torrent uh very portable very small this is actually live for any given archive item today uh small tip uh and unofficial feature is that if you upload a bar torrent file into an internet archive Network or uh item uh it will actually pull down the contents of that torrent into the item and that also act as a web seat uh so so uh yeah you can play around with that if you want uh we also uh use tools like storage which actually mostly just works uh it hasn't been super exciting uh but I find it to be a pretty good system uh we've used that and hypercore and that whole family of tools which are which have a lovely but very small community low usage and no uh sustainability model for their development so unfortunately that is largely dead and we've also played with things like weave which is just incomprehensibly tiny compared to the amount of data that we would actually need to store uh by the way if anyone hears from C walrus come talk to me I would love to have a chat so the problems that we've encountered with uh many of these Technologies along the road sort of tend to follow the same patterns uh and the biggest one is that they're built with too many assumptions uh they're almost always developed uh by teams working with a pretty much unlimited budget of ec2 credits or other infrastructure on demand that they can deploy just at the scale that they need to see fit or else they have bare metal that is speced exactly to what the tools demand which is actually not how most real world organizations are going to work uh there's also a lot of assumptions about egress people are very confused when they look at our near terabit of interconnect and then look at the transfer speeds and the answer is very simple well people are using all of that interconnect to get to the content um and lastly storage mounts kind of a small technicality but you know everyone assumes that you have a nice file system that you can read from quite fast uh and that has all of the nice flags and all of the things that you expect from a real file system well actually not every situation will be like that uh this is why we like some of the Tooling in file coin World which is uh based on AR clone uh which provides a level of abstraction uh and uh compartmentalizes some of the finer access details uh and allows the community to contribute uh the second problem that I've seen time and again is overs specified implementations that were made in a kind of clean room environment without talking to the users or rolling out minimal implementations so a great example of this is the file coin storage and retrieval markets payment Channel grafing all of that stuff took a huge amount of work it was mostly ripped out decommissioned deprecated or otherwise unused so please ship your main net with simple Primitives let the users find their path Wiz IT and okay this last bit is cut off uh but the promise of a lot of these systems is that they will undercut the storage costs and even with the subsidies that we see for many of these systems our storage cost is 500 bucks per terab per 100 years um usually when you show that to people working in the space they walk away uh wide-eyed and unable to comprehend how we do it which is actually by not by having a class zero data center with no air conditioning no backups using the spinning discs that are highly cost optimized and having highly dedicated stuff um so some of the E work that doesn't necessarily touch uh on the main domain of storage uh that we've undertaken or thinking about uh is uh one is tour which uh you don't necessarily think about in the dwab context but is very much a network in that spirit and certainly embodying a lot of this uh the ideas of privacy uh that we are also concerned with uh so we have a t onion service that has had a lot of optimizations done for um uh for separating and identifying the traffic such that you don't get throttled just by sharing exit Noe with an abusive user and so on uh just go to archive.org in tour browser Brave smash the purple pill in the title bar and uh it'll take you there uh tills notary this is another big double star that connects back uh to some of the authentication context so we have uh a uh user sourced uh we call it archive team uh crawling arm uh where uh we will U archive a site um uh such as flicker uh if there is a uh you know kind of a 5:00 P p.m. on Friday downtime on Monday announcement uh and uh service like TS notary can theoretically allow us to authenticate those craws uh and uh bestow them with a similar status of uh authenticity or source of Truth as our own craws um we bought but we do need to be convinced because our team is a little bit old school so talk to us come to us convince us that the cryptographic proofs provided by such a system are actually important I believe that but not everyone does and aside from this there's a lot of other stuff so as a trivial example you can store your Genesis block with us uh you don't have to we don't want to be uh kind of a just a generic free storage for everything but culturally signic ific things such as Genesis blocks absolutely come to us bring them stick them in an item uh we will sign off on them um another thing that I should have put a double star on is gated access and Compu over data for ML uh we are obviously as mentioned in the previous talk a large source of training data for all sorts of models and while we are work in a private uh public service we are not actually a fully World readable resource and we also exercise some amount of rate limiting and so on for abusive or overly enthusiastic consumers so thinking about how to package that data how to gate access to that data and how to bundle it with computer over data uh for training is a really high priority for me personally so if you're interested in that also please come and talk to me um indexing preservation of web three content is another one that has been on the back burner um of course there is a lot of services such as uh nft storage KN web3 storage uh Etc uh that will provide a subsidized permanent storage for you but can you rely on that can you rely on that type of organization to actually provide long-term durability that's an open question I think our mandate uh puts us in a distinctly different and distinctly more durable position uh and lastly we're very open to experimentation uh and we would like to experiment more with different modalities anything that you see sticking into or sticking with our principles that I outlined at the beginning uh is fair conversation talk to me talk to our founder Mr Bruce kale and uh there's a pretty good chance we'll you know host the node or at least give you some brutally honest feedback

Share your thoughts