System Design 2.0
By the way what are the famous data storing methods,based on the nature of data?. I though it’s better to give some idea,since I’m gonna use it.
Blob Store: Unstructured data (not good with tabular format).What are the solution for these types of data? ,GCS and Amazon S3 are some of them.Keep in mind that even though these data types also work in key value pair manner .There is a difference actually ,blob store good with inserting and accessing massive amount of data ,but key value store and optimized in latency ,rather than availability and durability.
Time Series DB: influxDB and Prometheus are used in the scenarios where data getting change with high frequency which is very common in time series analysis.
So we are gonna pick one for our designing purpose.Let’s choose Google Drive to design as start.So what now you gonna design ? wait …first get clarified .Think what are the sections that you need to be get clear first. By the ways for the records ,”Clients are always correct :)”_
Nope ,pause and grab some question ,
You may think ,okay I have used the drive so what now.??
wrong ,does other party need to include all the feature like sharing file and integrated product like docs too.You guess they are not necessary .but always ask and clarify. Yp ,that’s how it work when it comes to system design.
take 30s and think further ,
Now based on the following assume it provides answers for all queation
- it should possible to download the files and rename it.it should able to create new folders.It’ s just web application.
- assume that two clients won’t make changes to the same file or folder at the same time( let’s not worry about conflicts )
- This system should serve about a billion users and handle 30GB per user on average.
- need to make sure that once a file is uploaded or a folder is created, it won’t disappear until the user deletes it.need this system to be highly available
So what will be the action which are supposed to happen in this scenario.
delete ,rename ,move ,upload and download the files and folders.When it comes to folders we have creation option too.
storage solution for two types of data:
- File Contents: The contents of the files uploaded to Google Drive. These are opaque bytes with no particular structure or format.
- Entity Info: The metadata for each entity. This might include fields like entityID, ownerID, lastModified, entityName, entityType. This list is non-exhaustive, and we’ll most likely add to it later on.
it’s better to have some idea about how sharding and replications works. When we have main DB other replicated DB can updated same time or one after other ,but consideriable speed .That basically depends on the nature of the product that your dealing with. Sharding in the other hand ,kind of different ,Let’s see that in high level for the purpose of solve this issue what we are having currently.Imagint you got a employee details table and you are gonna distribute that table raws like ,based on the name of the user([(A-E)=shard1 ,[(F-G)=shard2],…]).Silly example but ,easy to understand.
Lest’ go step by step:
Storing Our File Entity information:
Since we need high availability and data replication, we need to use something like Etcd, Zookeeper, or Google Cloud Spanner (as a K-V store) that gives us both of those guarantees as well as consistency .
DynamoDB, give us only eventual consistency
Disadvantage of Sharding:
Sharding on entityID means that we’ll lose the ability to perform batch operations.
which these key-value stores give us out of the box and which we’ll need when we move entities around (for instance, moving a file from one folder to another would involve editing the metadata of 3 entities(rename also can take into this); if they were located in 3 different shards that wouldn’t be great[unnecessary process]).
Instead, we can shard based on the ownerID of the entity, which means that we can edit the metadata of multiple entities atomically with a transaction, so long as the entities belong to the same user.
possible to have layer of proxies for entity information, load balanced on a hash of the ownerID. The proxies could have some caching, as well as perform ACL(access control list) checks when we eventually decide to support them.
Storing File Data
Blob splitters will have the job of splitting files into blobs and storing these blobs in some global blob-storage solution like GCS or S3. Why we need this ,don’t forget the requirement large uploads and data storage that we are dealing with.
to make sure about the redundancy ,let’s pushing to 3 different GCS buckets and consider a write successful only if it went through in at least 2 buckets. This way we always have redundancy without necessarily sacrificing availability.So later scenarios which is currently pending can handle with a another async service.
but what happen blobs get repeated at the storage?
name the blobs after a hash of their content.Before but our content we are checking is there already something like this or not.Due to this these blobs are become immutable(kind of not stubborn behavior:))
but after doing these stuff we have to share some letters right?:) for the communication with the way which every entity can understand.
- Folder Info
children_ids: ['id_of_child_0', 'id_of_child_1'],
have a look in the format think as a folder what are the things that you have to deal with.Basically you have folders with in you ,and.. you can be inside someone.
- File Info
blobs: ['blob_content_hash_0', 'blob_content_hash_1'],
some times while iterating it’s easy to have some kind of tag to identify these files and folders separately.
Remove Unusable file
Garbage Collection service that watches the entity-info K-V stores and keeps counts of the number of times every blob is referenced by files; these counts can be stored in a SQL table.
Reference counts will get updated whenever files are uploaded and deleted. When the reference count for a particular blob reaches 0.
According to the plan:
CreateFolder: since folders don’t have a blob-storage component, creating a folder just involves storing some metadata in our key-value stores.
UploadFile works :first is to store the blobs that make up the file in the blob storage. Once the blobs are persisted, we can create the file-info object, store the blob-content hashes inside its blobs field, and write this metadata to our key-value stores
DownloadFile :fetches the file’s metadata from our key-value stores given the file’s ID. The metadata contains the hashes of all of the blobs that make up the content of the file(remember the format we have shared ,have a look:)), which we can use to fetch all of the blobs from blob storage.
We can then assemble them into the file and save it onto local disk(if we have use some kind of different mechanism ,we have to follow those rules here).
move ,rename and delete operations are gonna handle through changing meta data.
So basically what i have done is,use the theories ,and explain it in simple way as much as i can.There may be thousand combinations .But the point is ,going for a solution with what we got.