- Distributed Wikipedia Mirror (opens new window) and Kiwix (opens new window) projects are happy to the announce general availability of updated English (opens new window) and Turkish (opens new window) mirrors, along with new languages: Myanmar (opens new window), Arabic (opens new window), Chinese (opens new window) and Russian (opens new window).
- A handy, up-to-date list can be found at ipfs.kiwix.org (opens new window), and in the
snapshot-hashes.yml
(opens new window) manifest. - The idea of a distributed Wikipedia mirror goes back to 2017, when the IPFS Project created a snapshot of English and Turkish languages and put it on IPFS. To learn why we did it, please read the original Uncensorable Wikipedia on IPFS (opens new window) post.
- Below is a short status update with improved usage instructions, current build process, open problems, and future work that could be contributed to the project.
# Improved access to Wikipedia mirrors
# User-friendly ipns://{dnslink}
and public gateways
Browsers with built-in support for IPFS addresses (Brave (opens new window), Opera (opens new window), or a regular Firefox (opens new window), Chromium (opens new window) with IPFS Companion (opens new window)) can now load the latest snapshot using DNSLink (opens new window):
ipns://{dnslink}
ipns://en.wikipedia-on-ipfs.org
To ensure true P2P transport, offline storage and content integrity, you can run your own IPFS node (command-line (opens new window) or IPFS Desktop (opens new window) app) combined with the IPFS Companion (opens new window) browser extension. You can also use the Brave browser, which has built-in support for IPFS (opens new window):
When it is not possible to run your own IPFS node, one of the many public gateways (opens new window) can be used as a proxy for accessing the mirror. For example:
- https://dweb.link/ipns/my.wikipedia-on-ipfs.org (opens new window)
- https://cf-ipfs.com/ipns/my.wikipedia-on-ipfs.org (opens new window)
# Robust and immutable ipfs://{cid}
If DNS resolution is blocked, or a public gateway can't be trusted, accessing the immutable snapshot using underlying cryptographic content identifier (CID (opens new window)) is advised:
ipfs://{cid}
The {cid}
of a specific mirror can be found in snapshot-hashes.yml
(opens new window), or read from its DNSLink record withipfs resolve -r /ipns/en.wikipedia-on-ipfs.org
. At the time of writing this post, the English mirror points at ipfs://bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
Sharing CIDs via sneakernet (opens new window) is a popular way of routing around DNS issues and censorship. Turkish citizens resorted to that in 2017 when Turkey blocked Wikipedia (opens new window). History does not repeat itself, but it rhymes: Myanmar started experiencing internet blackouts earlier this year:
Confirmed: #Myanmar has blocked all language editions of the Wikipedia online encyclopedia, part of a widening post-coup internet censorship regime imposed by the military junta 📚
— NetBlocks (@netblocks) February 19, 2021
Network data show restriction in effect on major providers.
📰 Report: https://t.co/Jgc20OBk27 pic.twitter.com/qstGEefO4E
To address this critical need, we created a mirror of Myanmar Wikipedia (opens new window) and shared both DNSLink and CID (opens new window).
In response to ongoing internet restrictions / censorship in Myanmar, @Wikipedia in MY is now on @IPFS:https://t.co/trt0AbEMuW
— dietrich (@dietrich) February 25, 2021
Huge props to @playingwithsid who proposed it, & coordinated w/ native speakers.
Epic implementation effort by @lidelOrg & Kelson of @KiwixOffline!
# How to help co-hosting this?
You can run your own IPFS node and co-host a subset of Wikipedia, store a full copy, or even follow collaborative cluster to pull in future updates automatically.
It is also possible to donate co-hosting costs by pinning specific CID to a remote service.
# Lazy co-hosting with your own IPFS node
It is possible to keep a lazy-loaded copy. which does not fetch the entire Wikipedia, but keeps the browsed subset of pages around.
$ ipfs files cp /ipfs/{cid} /my-wikipedia-snapshot
One can convert a lazy copy to full one by recursively pinning the DAG (opens new window) behind a CID:
$ ipfs pin add --progress {cid}
A recursive pin will preload the entire mirror to the local datastore.
Be wary that the English one is far bigger than other ones, and pinning it requires hundreds of gigabytes and may take a very long time.
The size of a specific mirror can be read with ipfs files stat /ipfs/{cid}
.
# Collaborative cluster
This is an advanced option aimed at server administrators and power users. The wikipedia
cluster includes all language versions and its size only grows over time.
$ ipfs-cluster-follow wikipedia run --init wikipedia.collab.ipfscluster.io
See Instructions at collab.ipfscluster.io (opens new window).
# Donate remote pins
When co-hosting with your own IPFS node is not possible, one can still help by pinning snapshot CIDs to a remote pinning service.
Learn how to work with remote pinning services (opens new window).
# How is a mirror built?
The current setup relies on Wikipedia snapshots in the ZIM format (opens new window) produced by the Kiwix (opens new window) project.
We don't have a web-based reader of ZIM archives (yet – more in the next section), and the way we produce a mirror is an elaborate, time-consuming process:
- Unpacking ZIM archive with openzim/zim-tools (opens new window)
- Adjusting HTML/CSS/JS to fixup unpacked form
- Import snapshot to IPFS
- Include original ZIM inside of unpacked IPFS snapshot
While this works, the need for unpacking and customizing the snapshot makes it difficult to reliably produce updates. And including the original ZIM for use with Kiwix offline reader (opens new window), partially duplicates the data.
We would love to mirror more languages, and increase the update cadence, but for that to happen we need to remove the need for unpacking ZIM archives.
We will be looking into putting all ZIMs from Kiwix (opens new window) on IPFS and archiving them for long term storage on Filecoin (opens new window) as part of farm.openzim.org (opens new window)pipeline.
# Help Wanted and Open Problems
If you are still reading this, there is a high chance you are interested in improving the way the distributed Wikipedia mirror works.
Below are areas that could use a helping hand, and ideas looking for someone to explore them.
- Search. There's no search function currently. Leveraging the index present in ZIM, or building a DAG-based search index optimized for use in web browsers would make existing mirrors more useful. See distributed-wikipedia-mirror/issues/76 (opens new window).
- Web-based ZIM reader. The biggest impact for the project would be to create a web-based reader capable of browsing original ZIM archives without the need for unpacking them, nor installing any dedicated software. Want to help make it a reality? See kiwix-js/issues/659 (opens new window)
- Improving the way ZIM is represented on IPFS. When we store an original ZIM on IPFS, the DAG is produced by
ipfs add --cid-version 1
. This works fine, but with additional research on customizing DAG creation, we may improve deduplication and speed when doing range requests for specific bytes. There are different stages to explore here: if any of them sounds interesting to you, please comment in distributed-wikipedia-mirror/issues/42 (opens new window).- Stage 1: Invest some time to benchmark parameter space to see if low hanging fruits exists.
- Stage 2: Create a DAG builder that understands ZIM format and maximizes deduplication of image assets by representing them as sub-DAGs with dag-pb files.
- Stage 3: Research augmenting or replacing ZIM with IPLD (opens new window). How can we maximize block deduplication across all snapshots and languages? How would an IPLD-based search index work?