IPFS:为什么要用IPFS存储大量数据?

IPFS:为什么要用IPFS存储大量数据?

为什么要用IPFS存储大量数据?IPFS最吸引人的特点是什么?哪些地方是大家担心的?在IPFS的官方论坛 discuss.ipfs.io,团队针对人们为什么使用或不使用IPFS来存储大量数据的原因做了一个调查。我们整理和翻译了其中一些爱好者或关注者的回答,大家也可以在评论区留言自己的想法。

 

flyingzumwalt(Matt Zumwalt):

针对有大量数据(几十TB至几十PB)需要存储在IPFS上的用户,我们正在做一轮采访,以便了解他们的诉求。对于这些有大量数据处理需求的用户来说,IPFS吸引他们的关键因素以及他们的忧虑是什么?我们有一些初步猜测想在采访中验证,也想听听大家的想法,看有没有其他因素我们没有考虑到的。

以下是我认为IPFS会吸引人们使用的几个特点:

  • 内容寻址(Content Addressing)
  • 在不损害数据完整性的情况下,移动、复制和重新提供数据的能力
  • 无需复制(或引用)整个数据集,可以子选择其中一部分的能力
  • 为版本控制提供基础(内容寻址的),但又独立于版本控制元数据结构
  • 支持高效的动态聚合和来自多个位置的数据分析

下面几点是我认为人们在评估IPFS时会考虑的因素:

  • 可靠性(Reliability)
  • 可扩展性(Scalability)
  • 安全性(Security)
  • 性能(Performance)

大家有什么要补充的吗?

 

tkklein(Tomk):

我刚接触IPFS,还是个小白,希望我的想法没有偏离轨道 : )

我会加上简单性(simplicity)和整合(integration)这两个因素。机构会有某种存储虚拟化技术,他们会想要把IPFS接入进去。他们还会有内容和数据管理系统(Hadoop之外),理论上不需要了解IPFS,但是可能还需要进一步确定。举个例子,我在研究视频数据(特别是所有随身摄像头/仪表盘生成的相机数据)。这些数据没有办法保存太久,因为存储成本太高。如果不能使用一种现有的接口,政府不会额外雇人来使用IPFS。

[I'm new to IPFS so hopefully my comments aren't out of the lane you intended.

I'd add simplicity and integration to your list of factors. Organizations will have some sort of storage virtualization technology and they'll want IPFS to plug into that. They'll also have content and data management systems (beyond Hadoop) that theoretically shouldn't need to know about IPFS but probably needs to be confirmed. For instance, I'm digging into video (specifically all the bodycam/dashboard cam data being generated). It's not saved for very long today since is too expensive. Governments aren't going to want to hire extra people to use IPFS if can't use an existing interface.]

 

eocarragain(Eoghan o Carragain)

我想补充两点:

  • 成熟和稳定性
  • 应用度和整合度(比如整合到浏览器和操作系统)

[I'd probably add:

  • maturity and stability of specs
  • degree of adoption and integration (e.g. with browsers and operating systems)]

 

jeiros(Juan Eiros):

我最近才知道和开始了解IPFS,在我看来,IPFS对计算机科学的再现性(reproducibility)可能有着非常积极的作用。

在我的特定研究团体,大量(多达10TB)二进制文件通过非常耗时的模拟产生。因此恰当地保存这些文件非常重要(一旦丢失文件,意味着得重新模拟,耗时长达几个月)。把文件分享给同事也很重要,不幸的是,在实际操作中这很难实现。比如说,我在欧洲工作,就无法下载存储在斯坦福数据库几TB的模拟数据集,要花很长的时间才能办到。

就我目前了解到的IPFS相关信息来说,我觉得重点是能通过联系到网络中离你最近的人,从而提高文件分享速度,而不再是基于一个中央储存库。但同时我也了解到这样就不能再复制了,网络中的每个节点只存储它“感兴趣”的内容。那么,在我上面提到的例子中,IPFS如何决定由谁来存储这些大量数据集?复制这些数据是不是很烧钱?如果是的话,就会回到我现在面临的情况:在全球范围内下载一个巨大的数据集是办不到的。

[Just to add to the conversation:

I've just recently found out about IPFS and to me it seems like it can potentially be really positive for science reproducibility.

In my particular research community, large (up to around 10TB) binary files are generated through very time-consuming simulations. Storing them appropriately is a big deal (losing files means having to repeat simulations that can span several months). Sharing them with colleagues is of course also really important and is something that is not always doable in practice, unfortunately. For example, I can't download simulation datasets of several Terabytes that are hosted at Stanford's repository, since I am based in Europe, and would take me an absurdly long time to do so.

From what I've gathered in my short time reading about IPFS, the whole point is to increase file sharing speed through talking to your nearest neighbour in the network, and not necessarily a central repository. But I've also read that duplication is avoided, and that each node in the network stores only content it is 'interested' in. Therefore, in the case that I mentioned before, how would IPFS decide who stores these large datasets? Wouldn't it be too costly to have them duplicated? If so, we would be back at the situation that I am now: downloading a huge dataset from across the globe is infeasible.

I'm interested in reading comments on this from more knowledgeable members of the IPFS community.]

 

looccm(Matt McCool):

大多数公司把大型工作负载存储在 EMC Isilon 或 Netapp,你上面提到的关于使用IPFS时会考虑的4点因素,这两者都有局限性。我的工作是存储销售这块,几乎所有的顾客都想把大量的工作档案存储在 AMS (亚马逊旗下云计算服务平台)或 Azure(微软云计算服务平台) – 这总是比较容易实现。因此,档案存储用例会是一个有趣的切入点,尤其是在数据生成量达到PB级的行业,比如媒体或研究领域。

[Most companies store such large workloads on EMC Isilon or Netapp, who all have limitations on the four factors you listed above as to why use IPFS. I work on the sales side in storage but can say that almost all of my customers are looking to dump large archive workloads to AWS or Azure - this is always the low hanging fruit. So, archive use cases could be an interesting play especially in industries that generate PB's of data like Media or Research.]

 

kehao95(kehao):

我在一家网络用户行为研究公司工作,类似于谷歌分析(Google Analysis)。跟踪代码每天产生几TB的数据,我们把数据存储在 AWS S3,设置有效期限,把总量控制在几百TB内。我们正在寻求减少数据重复的方法,以节约成本。每天有数百万个时域(session),一旦配置js-ipfs后,意味着整个网络上会有数百万个ipfs节点(短期的,几秒到几十分钟)。我相信这能释放IPFS的最大潜力。

当用户访问一个站点时,我们监看和记录网页产生的所有DOM更改,保存 session,便于之后用于分析。目前我们需要这些东西:

  1. 版本控制(version control)或IPFS白皮书6.3中提到的树对象(The Tree Object)。现在我们用的是一种差异算法来计算DOM更改,把原始数据和差异都存入文档。如果IPFS的树对象能实现,我们就能减少很多重复内容,节约大量空间。
  2. 可靠的push (或上传) 方法。我用 PubSub(发布/订阅模式)演示过,似乎还无法保证内容接收。因为标签可能随时关闭,我们需要在微秒之内将数据push到后台。

[Hi, I work in a Web user behavior analysis company, you can compare to the Google Analysis. And the tracking code generates several TBs of data every day. And we store them in AWS S3 setting the expiration so that limit the total volumes to hundreds of Terabytes. We are seeking the ways to reduce duplication of data stored so that we can save money.

There are millions of sessions per day, that means we will have millions of ipfs nodes (short-lived, from seconds to tens of minutes) across the web once we deploy the js-ipfs on it. I believe that may release the most potential of IPFS.

OK, back to the point. Basically, we are watching and recording all the DOM changes happens on the page while the users are visiting the site so that we can restore the session in the future for analysis. Currently, we need the following things:

  1. The version control or The Tree Object mentioned in IPFS white paper 3.6.3. Right now we are using a diff algorithm to calculate the DOM changes. And store both the origin and the diffs into files. I believe if the IPFS Tree Object is guaranteed. We would reduce many duplications and save much space.
  2. Reliable push (or upload) method. I've tried PubSub for a demo, seems that the receive of the content is not guaranteed yet. Since the tab can be closed at any time. It's very important for us to push the data to backend within microseconds. (Well, there may be some walkarounds.]

 

ChristianKI:

你没提到代码复杂性。我觉得如果能做到只需要给一个客户写一个软件栈,而不必给客户和服务端各写一个不同的软件栈,这会是一个优势。

[You don't have code complexity in your list. I would guess that the potential of only writing one software stack for a client and not having to write one software for a client and the other for the server would be an advantage.]

 

flyingzumwalt(Matt Zumwalt):

@ ChristianKI

我的理解是,它允许我们在一个地点随机且可根据需求更改的广义系统中,可以把一切事物都看做节点、服务和工作者 -- 比如,它可以让你模糊服务端和客户端之间的区别。取代了强制将服务端和客户端进行二分的方式,让你可以在一台接近数据的设备或一台距离很远的设备上进行分析,或者将数据复制到一个新的地点进行分析。某种程度上这样简化了你的代码基,因为你可以少编写一些能让客户端应用、工作者重复利用的库和服务,无论他们身在何处。

[One way I tend to think of this is that it allows us to switch to thinking of everything as nodes, services and workers in a broad system, where the location is incidental and changeable based on needs -- for example it allows you to blur the distinction between server-side and client-side analysis. Instead of forcing a dichotomy of server-side vs client-side, it lets you think in terms of performing analysis on a device that's close to the data, on a device that's further away, or to replicate the data to a new location and analyze it there. In a way this simplifies your code base because it lets you write little libraries and services that can be reused in client applications, workers, etc. regardless of where they're run.]

(更多关于IPFS的最新讨论,可以前往IPFS官方论坛 discuss.ipfs.io 了解~)

ipfs原创,作者:Ironyecho,转载请注明出处:http://ipfser.org/2018/02/01/r22/

15

扫一扫,分享到微信

猜你喜欢

文章评论

电子邮件地址不会被公开。 必填项已用*标注

获取验证码
后发表评论

微信公众号

知识星球