Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [GSoC][RFC][PROPOSAL] More Sparse Index Integrations
@ 2023-03-28  1:49 Shuqi Liang
  2023-03-28  2:12 ` Shuqi Liang
  2023-03-31 19:35 ` [GSoC][PROPOSAL v2] " Shuqi Liang
  0 siblings, 2 replies; 4+ messages in thread
From: Shuqi Liang @ 2023-03-28  1:49 UTC (permalink / raw
  To: git; +Cc: Shuqi Liang, vdye, gitster, derrickstolee

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 12464 bytes --]

Hello everyone,

Here is the initial version of my proposal on "More Sparse Index Integrations". I would really love to have comments on it. If you
find any mistakes please notify me about that.

And website version is there: https://docs.google.com/document/d/1WtoLgAJYVHY_NWNscqi358FeAY-6UBARCmpTU3BAQMs/edit#

==================================================================

## More Sparse Index Integrations
Personal Details

### About Me
| Name | Shuqi Liang |

| Mobile no. |  +1 (416) 272-1737|
| Email | cheskaqiqi@gmail.com|

| Github | https://github.com/Cheskaqiqi |
| Major | Computer Science |
| Time Zone | EDT (UTC  -4 hours) |
    

### Me & Git

I am a Computer Science undergraduate at Western University (Canada). I have learned C, Python, Java, and shell in the last two years. I have started exploring the Git codebase since Jan 2023.Here is some related documentation I have read:


# MyFirstContribution.txt   
# SubmittingPatches

*Way to make contributions to the git project.

# MyFirstObjectWalk.txt

*Way uses git's Object Walk API to traversethe git object database.
Topics such as object types and IDs.


# Hacking Git
# Understanding Git — Index 
# Git Internals
# UnderstandingGit  DataModel
# UnderstandingGit Branching
# Underlying data model of git and how it works.

*Types of git objects: blobs, trees, and commits.
*The concept of branching in Git and how it is used to manage code changes.
*Two layers of git: the low-level plumbing layer and the high-level  porcelain layer.
*Relationship between the Index and the working directory.


# Make your monorepo feel small with 
# Bring your monorepo down to size with sparse-checkout 

*How the git sparse checkout and git sparse index feature can help make large monorepos feel smaller and more manageable

# Sparse-checkout.txt
# Sparse-index.txt

I would like to thank Shaoxuan, the 2022 GSoC who recommended these two articles to me. The articles cover some of the limitations and potential issues with ‘sparse-checkout’ and ‘sparse-index’.


###Contributions


#Status 
wip
#Subject
[Microproject]: t4113: modernize test script
#Mail List Link
https://lore.kernel.org/git/20230215023953.11880-2-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
[Aided a potential GSoC student]: trace.c, git.c: removed unnecessary parameter to trace_repo_setup
#Mail List Link
https://lore.kernel.org/git/20230215175510.3631-1-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
check-attr: integrate with sparse-index
#Mail List Link
https://lore.kernel.org/git/20230227050543.218294-2-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
diff-files: integrate with sparse index
#Mail List Link
https://lore.kernel.org/git/20230322161820.3609-2-cheskaqiqi@gmail.com/




###The Project: More Sparse Index Integrations


#What's "sparse-checkout"

When the repository has so many files at root, it causes git commands to slow to a crawl (e.g., git checkout, git status). Now, ‘sparse-checkout’ allows users to restrict their working directory to only the files they care about. It is supposed to make users feel like they are in a small repository, even though they are contributing to a large one.[1]  If users use the "microservices in a monorepo" pattern, "sparse-checkout" can ensure the developer workflow is as fast as possible while maintaining all the benefits of a monorepo.[1]



#What's ‘git index’

In git, the index, or the staging area, is an intermediate area where changes to a Git repository are prepared before committing to the repository. The index file stores a list of every file at HEAD. This list of files is stored as a flat list.[2]




#What's ‘sparse-index’

Although ‘sparse-checkout’  has done very well, it still has a problem: the Git index is still large in a monorepo.
‘sparse-index’ allows the index to focus on the files within the sparse-checkout cone. The size of the sparse index will scale with the number of files within users' chosen directories instead of the full size of the repository. When enabled with a number of other performance features, this can have a dramatic performance improvement.[2]



#Problem when integrated with sparse index

The idea of 'sparse index' is easy to understand, but pruning the index at the directory level may cause a complicated result. That is because, In the Git codebase, numerous places directly interact with the index in various nuanced ways. All of them assume that each index entry refers to a blob object. But sparse-directory entries violate expectations about the index format and its in-memory data structure. Many consumers in the codebase expect to iterate through all of the index entries and see only files. Compatibility layers are made to expand a sparse index to an equivalent full index. So that even if the on-disk format is sparse, code paths that have not been integrated and tested with a sparse index can still be used.[2]

The Git Fundamentals team first started by creating the ensure_full_index() method, which converts a sparse index into a full one. 

But this method takes longer to expand a sparse index to a full one than to read a full one from a disk. It goes against our idea of utilizing 'sparse index' to enhance user experience. To gain the performance benefits of a sparse index and improve user experience, we need to optimize the compatibility and teach git to only expand the sparse directory entries only when needed. 

###Plan

Every integration will have similar steps, but the actual steps of commands integrated for the project will vary based on the complexity of the commands chosen:

(Notice that step 1 is from ShaoXuan's GSoC 2022 Git Contributor Proposal, and steps 2,4-7 are from GSoC 2023 Ideas. I made some additions to the steps.)

1. Investigation around a Git command and see if it behaves correctly with sparse-checkout. Modify the Git command's logic to work better with 'sparse-checkout'. Add corresponding tests.[3]

     *Step 1 does not often occur. But it is important to ensure  *"sparse-checkout" is compatible with the Git command to continue *with  the next step.

2. Add tests to t1092-sparse-checkout-compatibility.sh for the built-in, with a focus on what happens for paths outside of the sparse-checkout cone.


    t1092-sparse-checkout-compatibility.sh create a repository with some data shapes in it. Each test case starts by copying that repository into three new repositories with different configurations. These three are called   'full-checkout'  , 'Sparse-checkout' , and 'Sparse-index', respectively :

    'full-checkout' is the same as the repository, without sparse-checkout.

    'Sparse-checkout' with cone mode sparse-checkout enabled.

    'Sparse-index'  with cone mode sparse-checkout and sparse index       enabled.         

    Add tests for the git command we want to integrate with sparse-index to t1092  without the code change. Focus on the git command, which has the ability to affect full-checkout/sparse-checkout/sparse-index differently. The test case should run against all three repositories and have the same output and it should also work when the index is expanded. After integrating the git command with sparse-index, the output and behavior should remain the same.  


3. Add performance tests, so we have a baseline to measure how
   well the git command does.

    We need a baseline to measure the speed before integrating the git command with sparse-index. Normally we will notice the speed is quite slow caused by expanding the index.



4. Disable the command_requires_full_index setting in the built-in and ensure the tests pass.

5. If the tests do not pass, then alter the logic to work with the sparse index.

    Make the code change to only expand the sparse directory entries only when needed.command_requires_full_index setting guards all index reads to require a full index over a sparse index.

    After suitable guards are placed in the codebase around uses of the index, remove the setting. 

6. Add tests to check that a sparse index stays sparse.

    Add  ensure_not_expanded test to t1092-sparse-checkout-compatibility.sh,
    We expect the index to be expanded for out-of-cone moves. But we need to ensure the index will not expand for in-cone moves.

7. Run performance tests to demonstrate speedup.


###Project Timeline


#Empty Period (Present - May 4)

    My end-semester exams begin on April 4 to  April 28. Hence I might be a bit busy in this period.

    After April 28, I will continue to work on 'git diff-files' and start to work on ' git describe '

#Community Bonding Period (May 4- May 28) 

    Get familiar with the community

    I have read the related documentation about 'Sparse Index Integrations' and working on 'git diff-files' , one of the builtins that could be integrated with the sparse index. The feedback and the advice for improvement make me learn a lot. And I'm confident I can start the project at the start of this period.

    Keep working on "git diff-files' and  'git describe' on May 5, and the expected time of completion of these two is May 28.


#Phase 1 (May 29 -July 9)

    week 1 to week 3  (May 29-June 18): integrate  'git write-tree' with sparse index. Use the steps above.

    week 4 to week 6(June 19 -July 9 ): integrate  'git diff-index' with sparse index,use steps above.


#Phase 2 (July 9 - August 28)

    week 1 to week 3  (July 9 - July 23): integrate  'git diff-tree' with sparse index. Use the steps above.

    week 4 to week 6 (July 23 - August 13 ): integrate 'git worktree' with sparse index. Use the steps above.

    week 7(August 13-August 28) :
    A buffer of one week has been kept for any covert difficulties when integrated with the sparse index.

###Blogging about Git

When I was a freshman, I hated writing summaries or other learning materials. But then I started writing blogs for new knowledge to keep track of what I've learned. I realized that 
When I  dive into a topic and want to write it down, I  will think much deeper about it than just learning. I also learned a lot and gained many skills from others' blogs.
I would love to write about my progress and experiences with the project.
In this way, I could share the ideas with those interested in researching this project and help them get up to speed more quickly.


###Availability

My semester will complete in late April, leaving me enough time to prepare for my GSoC project. Git is the only project for my summertime. If I am selected, I shall be able to work five days per week, 7 - 8 hours per day, around 35 -40 hrs a week on the project, though I am open to putting in more effort if the work requires. If everything is going well with the plan, I may want to Participate in a hackathon for a few days with my friends in July.


###Post GSoC


I have received much support from many members of the Git community in recent months. This support has strengthened my passion for git and inspired me to contribute more of my code to the community. Just as others have helped me, I will pay it forward by assisting and encouraging new community members. I believe that sharing knowledge and collaborating with others is the key to creating great software and achieving success in the open-source world.

I am committed to delivering high-quality work and meeting the expectations of the community. I am eager to learn from experienced community members and gain new skills and knowledge that will help me become a more valuable contributor. I am eager to continue working with the Git community beyond the scope of the GSOC program. I believe I have much to offer the community, and I am committed to contributing to its success for a long time.

I hope that the Git community will consider my application for the GSOC program. It would be an honor to be able to contribute to such a fantastic open-source project and work with such a supportive and welcoming community. Thank you for your time and consideration.



###References

[1] Bring your monorepo down to size with sparse-checkout. https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/

[2] Make your monorepo feel small with Git’s sparse index. https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/


[3]Step 1 in the plan is come from  shaoxuan’s proposal in 2022
https://docs.google.com/document/d/1VZq0XGl-wCxECxJamKN9PGVPXaBhszQWV1jaIIvlFjE/edit

Thanks & Regards,
Shuqi




^ permalink raw reply	[flat|nested] 4+ messages in thread

* [GSoC][RFC][PROPOSAL] More Sparse Index Integrations
  2023-03-28  1:49 [GSoC][RFC][PROPOSAL] More Sparse Index Integrations Shuqi Liang
@ 2023-03-28  2:12 ` Shuqi Liang
  2023-03-28 17:16   ` Victoria Dye
  2023-03-31 19:35 ` [GSoC][PROPOSAL v2] " Shuqi Liang
  1 sibling, 1 reply; 4+ messages in thread
From: Shuqi Liang @ 2023-03-28  2:12 UTC (permalink / raw
  To: git; +Cc: Shuqi Liang, vdye, gitster, derrickstolee

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 12706 bytes --]


Hello everyone,

Sorry about the wrapping mistake in last patch.

Here is the initial version of my proposal on "More Sparse Index 
Integrations". I would really love to have comments on it. If you find any 
mistakes please notify me about that.

And website version is there: 
https://docs.google.com/document/d/1WtoLgAJYVHY_NWNscqi358FeAY-6UBARCmpTU3BAQMs/edit#

==================================================================

## More Sparse Index Integrations
Personal Details

### About Me
| Name | Shuqi Liang |

| Mobile no. |  +1 (416) 272-1737|
| Email | cheskaqiqi@gmail.com|

| Github | https://github.com/Cheskaqiqi |
| Major | Computer Science |
| Time Zone | EDT (UTC  -4 hours) |
    

### Me & Git

I am a Computer Science undergraduate at Western University (Canada). I have 
learned C, Python, Java, and shell in the last two years. I have started 
exploring the Git codebase since Jan 2023.Here is some related documentation 
I have read:


# MyFirstContribution.txt   
# SubmittingPatches

*Way to make contributions to the git project.

# MyFirstObjectWalk.txt

*Way uses git's Object Walk API to traversethe git object database.
Topics such as object types and IDs.


# Hacking Git
# Understanding Git — Index 
# Git Internals
# UnderstandingGit  DataModel
# UnderstandingGit Branching
# Underlying data model of git and how it works.

*Types of git objects: blobs, trees, and commits.
*The concept of branching in Git and how it is used to manage code changes.
*Two layers of git: the low-level plumbing layer and the high-level 
porcelain layer.
*Relationship between the Index and the working directory.

# Make your monorepo feel small with 
# Bring your monorepo down to size with sparse-checkout 

*How the git sparse checkout and git sparse index feature can help make 
large monorepos feel smaller and more manageable

# Sparse-checkout.txt
# Sparse-index.txt

I would like to thank Shaoxuan, the 2022 GSoC who recommended these two 
articles to me. The articles cover some of the limitations and potential 
issues with ‘sparse-checkout’ and ‘sparse-index’.


###Contributions


#Status 
wip
#Subject
[Microproject]: t4113: modernize test script
#Mail List Link
https://lore.kernel.org/git/20230215023953.11880-2-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
[Aided a potential GSoC student]: trace.c, git.c: removed unnecessary 
parameter to trace_repo_setup
#Mail List Link
https://lore.kernel.org/git/20230215175510.3631-1-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
check-attr: integrate with sparse-index
#Mail List Link
https://lore.kernel.org/git/20230227050543.218294-2-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
diff-files: integrate with sparse index
#Mail List Link
https://lore.kernel.org/git/20230322161820.3609-2-cheskaqiqi@gmail.com/




###The Project: More Sparse Index Integrations


#What's "sparse-checkout"

When the repository has so many files at root, it causes git commands to 
slow to a crawl (e.g., git checkout, git status). Now, ‘sparse-checkout’ 
allows users to restrict their working directory to only the files they care 
about. It is supposed to make users feel like they are in a small 
repository, even though they are contributing to a large one.[1]  
If users use the "microservices in a monorepo" pattern, "sparse-checkout" 
can ensure the developer workflow is as fast as possible while maintaining 
all the benefits of a monorepo.[1]



#What's ‘git index’

In git, the index, or the staging area, is an intermediate area where 
changes to a Git repository are prepared before committing to the 
repository. The index file stores a list of every file at HEAD. This list of 
files is stored as a flat list.[2]



#What's ‘sparse-index’

Although ‘sparse-checkout’  has done very well, it still has a problem: 
the Git index is still large in a monorepo.
‘sparse-index’ allows the index to focus on the files within the 
sparse-checkout cone. The size of the sparse index will scale with the number 
of files within users' chosen directories instead of the full size of the 
repository. When enabled with a number of other performance features, this 
can have a dramatic performance improvement.[2]



#Problem when integrated with sparse index

The idea of 'sparse index' is easy to understand, but pruning the index at 
the directory level may cause a complicated result. That is because, In the 
Git codebase, numerous places directly interact with the index in various 
nuanced ways. All of them assume that each index entry refers to a blob 
object. But sparse-directory entries violate expectations about the index 
format and its in-memory data structure. Many consumers in the codebase 
expect to iterate through all of the index entries and see only files. 
Compatibility layers are made to expand a sparse index to an equivalent full
 index. So that even if the on-disk format is sparse, code paths that have 
 not been integrated and tested with a sparse index can still be used.[2]

The Git Fundamentals team first started by creating the ensure_full_index() 
method, which converts a sparse index into a full one. 

But this method takes longer to expand a sparse index to a full one than to 
read a full one from a disk. It goes against our idea of utilizing 
'sparse index' to enhance user experience. To gain the performance benefits 
of a sparse index and improve user experience, we need to optimize the 
compatibility and teach git to only expand the sparse directory entries 
only when needed. 

###Plan

Every integration will have similar steps, but the actual steps of commands
 integrated for the project will vary based on the complexity of the commands 
 chosen:

(Notice that step 1 is from ShaoXuan's GSoC 2022 Git Contributor Proposal, 
and steps 2,4-7 are from GSoC 2023 Ideas. I made some additions to the steps.)

1. Investigation around a Git command and see if it behaves correctly with 
sparse-checkout. Modify the Git command's logic to work better with 
'sparse-checkout'. Add corresponding tests.[3]

    *Step 1 does not often occur. But it is important to ensure 
    *"sparse-checkout" is compatible with the Git command to continue 
    *with  the next step.

2. Add tests to t1092-sparse-checkout-compatibility.sh for the built-in, 
with a focus on what happens for paths outside of the sparse-checkout cone.


    t1092-sparse-checkout-compatibility.sh create a repository with some 
    data shapes in it. Each test case starts by copying that repository into 
    three new repositories with different configurations. These three are called 
    'full-checkout'  , 'Sparse-checkout' , and 'Sparse-index', respectively :

    'full-checkout' is the same as the repository, without sparse-checkout.

    'Sparse-checkout' with cone mode sparse-checkout enabled.

    'Sparse-index'  with cone mode sparse-checkout and sparse index enabled. 

    Add tests for the git command we want to integrate with sparse-index to 
    t1092 without the code change. Focus on the git command, which has the 
    ability to affect full-checkout/sparse-checkout/sparse-index differently.
    The test case should run against all three repositories and have the same 
    output and it should also work when the index is expanded. After integrating 
    the git command with sparse-index, the output and behavior should remain 
    the same.  


3. Add performance tests, so we have a baseline to measure how
   well the git command does.

    We need a baseline to measure the speed before integrating the git command 
    with sparse-index. Normally we will notice the speed is quite slow caused
     by expanding the index.



4. Disable the command_requires_full_index setting in the built-in and 
ensure the tests pass.

5. If the tests do not pass, then alter the logic to work with the 
sparse index.

    Make the code change to only expand the sparse directory entries only 
    when needed.command_requires_full_index setting guards all index reads 
    to require a full index over a sparse index.

    After suitable guards are placed in the codebase around uses of the index, 
    remove the setting. 

6. Add tests to check that a sparse index stays sparse.

    Add  ensure_not_expanded test to t1092-sparse-checkout-compatibility.sh,
    We expect the index to be expanded for out-of-cone moves. But we need to 
    ensure the index will not expand for in-cone moves.

7. Run performance tests to demonstrate speedup.


###Project Timeline


#Empty Period (Present - May 4)

    My end-semester exams begin on April 4 to  April 28. Hence I might be 
    a bit busy in this period.

    After April 28, I will continue to work on 'git diff-files' and start 
    to work on ' git describe '

#Community Bonding Period (May 4- May 28) 

    Get familiar with the community

    I have read the related documentation about 'Sparse Index Integrations'
    and working on 'git diff-files' , one of the builtins that could be 
    integrated with the sparse index. The feedback and the advice for 
    improvement make me learn a lot. And I'm confident I can start the project 
    at the start of this period.

    Keep working on "git diff-files' and  'git describe' on May 5, and the 
    expected time of completion of these two is May 28.


#Phase 1 (May 29 -July 9)

    week 1 to week 3  (May 29-June 18): integrate  'git write-tree' with 
    sparse index. Use the steps above.

    week 4 to week 6(June 19 -July 9 ): integrate  'git diff-index' with 
    sparse index,use steps above.

#Phase 2 (July 9 - August 28)

    week 1 to week 3  (July 9 - July 23): integrate  'git diff-tree' with 
    sparse index. Use the steps above.

    week 4 to week 6 (July 23 - August 13 ): integrate 'git worktree' with 
    sparse index. Use the steps above.

    week 7(August 13-August 28) :
    A buffer of one week has been kept for any covert difficulties when 
    integrated with the sparse index.

###Blogging about Git

When I was a freshman, I hated writing summaries or other learning materials.
But then I started writing blogs for new knowledge to keep track of what 
I've learned. I realized that When I  dive into a topic and want to write it 
down, I  will think much deeper about it than just learning. I also learned a 
lot and gained many skills from others' blogs. I would love to write about my 
progress and experiences with the project. In this way, I could share the ideas 
with those interested in researching this project and help them get up to speed more quickly.



###Availability

My semester will complete in late April, leaving me enough time to prepare 
for my GSoC project. Git is the only project for my summertime. If I am 
selected, I shall be able to work five days per week, 7 - 8 hours per day, 
around 35 -40 hrs a week on the project, though I am open to putting in more 
effort if the work requires. If everything is going well with the plan, I may 
want to Participate in a hackathon for a few days with my friends in July.


###Post GSoC


I have received much support from many members of the Git community in recent 
months. This support has strengthened my passion for git and inspired me to 
contribute more of my code to the community. Just as others have helped me, 
I will pay it forward by assisting and encouraging new community members. I 
believe that sharing knowledge and collaborating with others is the key to 
creating great software and achieving success in the open-source world.

I am committed to delivering high-quality work and meeting the expectations 
of the community. I am eager to learn from experienced community members and 
gain new skills and knowledge that will help me become a more valuable 
contributor. I am eager to continue working with the Git community beyond the
scope of the GSOC program. I believe I have much to offer the community, 
and I am committed to contributing to its success for a long time.

I hope that the Git community will consider my application for the GSOC 
program. It would be an honor to be able to contribute to such a fantastic 
open-source project and work with such a supportive and welcoming community. 
Thank you for your time and consideration.



###References

[1] Bring your monorepo down to size with sparse-checkout.
 https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/

[2] Make your monorepo feel small with Git’s sparse index. 
https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/


[3]Step 1 in the plan is come from  shaoxuan’s proposal in 2022
https://docs.google.com/document/d/1VZq0XGl-wCxECxJamKN9PGVPXaBhszQWV1jaIIvlFjE/edit

Thanks & Regards,
Shuqi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [GSoC][RFC][PROPOSAL] More Sparse Index Integrations
  2023-03-28  2:12 ` Shuqi Liang
@ 2023-03-28 17:16   ` Victoria Dye
  0 siblings, 0 replies; 4+ messages in thread
From: Victoria Dye @ 2023-03-28 17:16 UTC (permalink / raw
  To: Shuqi Liang, git; +Cc: gitster, derrickstolee

Shuqi Liang wrote:
> 
> Hello everyone,
> 
> Sorry about the wrapping mistake in last patch.
> 
> Here is the initial version of my proposal on "More Sparse Index 
> Integrations". I would really love to have comments on it. If you find any 
> mistakes please notify me about that.
> 
> And website version is there: 
> https://docs.google.com/document/d/1WtoLgAJYVHY_NWNscqi358FeAY-6UBARCmpTU3BAQMs/edit#
> 
> ==================================================================

Hi Shuqi!

> 
> ###Project Timeline
> 
> 
> #Empty Period (Present - May 4)
> 
>     My end-semester exams begin on April 4 to  April 28. Hence I might be 
>     a bit busy in this period.
> 
>     After April 28, I will continue to work on 'git diff-files' and start 
>     to work on ' git describe '

An initial integration for 'describe' was recently submitted [1] so, if
you'd like to start working on another integration after 'diff-files', you
may want to pick a different command.

[1] https://lore.kernel.org/git/pull.1480.git.git.1679926829475.gitgitgadget@gmail.com/

> 
> #Community Bonding Period (May 4- May 28) 
> 
>     Get familiar with the community
> 
>     I have read the related documentation about 'Sparse Index Integrations'
>     and working on 'git diff-files' , one of the builtins that could be 
>     integrated with the sparse index. The feedback and the advice for 
>     improvement make me learn a lot. And I'm confident I can start the project 
>     at the start of this period.
> 
>     Keep working on "git diff-files' and  'git describe' on May 5, and the 
>     expected time of completion of these two is May 28.
> 
> 
> #Phase 1 (May 29 -July 9)
> 
>     week 1 to week 3  (May 29-June 18): integrate  'git write-tree' with 
>     sparse index. Use the steps above.
> 
>     week 4 to week 6(June 19 -July 9 ): integrate  'git diff-index' with 
>     sparse index,use steps above.
> 
> #Phase 2 (July 9 - August 28)
> 
>     week 1 to week 3  (July 9 - July 23): integrate  'git diff-tree' with 
>     sparse index. Use the steps above.
> 
>     week 4 to week 6 (July 23 - August 13 ): integrate 'git worktree' with 
>     sparse index. Use the steps above.
> 
>     week 7(August 13-August 28) :
>     A buffer of one week has been kept for any covert difficulties when 
>     integrated with the sparse index.

Even working 35-40 hours a week (as you mention later), this is a pretty
ambitious timeline for integrating sparse index with commands. Does this
take into account the time needed to respond to review comments and make
changes to your patches addressing those comments? 

> 
> ###Blogging about Git
> 
> When I was a freshman, I hated writing summaries or other learning materials.
> But then I started writing blogs for new knowledge to keep track of what 
> I've learned. I realized that When I  dive into a topic and want to write it 
> down, I  will think much deeper about it than just learning. I also learned a 
> lot and gained many skills from others' blogs. I would love to write about my 
> progress and experiences with the project. In this way, I could share the ideas 
> with those interested in researching this project and help them get up to speed more quickly.
> 

Where would you post your blogs?

> 
> ###Availability
> 
> My semester will complete in late April, leaving me enough time to prepare 
> for my GSoC project. Git is the only project for my summertime. If I am 
> selected, I shall be able to work five days per week, 7 - 8 hours per day, 
> around 35 -40 hrs a week on the project, though I am open to putting in more 
> effort if the work requires. If everything is going well with the plan, I may 
> want to Participate in a hackathon for a few days with my friends in July.

I don't see it mentioned anywhere in the proposal - which project size [2]
are you planning for? 35-40 hours/week for (nearly) the whole GSoC period
implies "large" project, but it would help to make your intention explicit.

Also, this helps me understand *when* you'd be available, but not *how*. I
try to keep an open line of communication with any GSoC participant I'm
mentoring to help with any issues or bugs encountered; it's also very
helpful for me to get regular (daily or every few days) updates from the
GSoC contributor on their progress.
 
To do that, what kind of communication medium would you want to use? An
instant messaging application (Slack, Discord, IRC)? Email? Something else?
In general, I'd be interested to see more on your plan for communication &
collaboration.

[2] https://developers.google.com/open-source/gsoc/faq#how_much_time_does_gsoc_participation_take

> Thanks & Regards,
> Shuqi


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [GSoC][PROPOSAL v2] More Sparse Index Integrations
  2023-03-28  1:49 [GSoC][RFC][PROPOSAL] More Sparse Index Integrations Shuqi Liang
  2023-03-28  2:12 ` Shuqi Liang
@ 2023-03-31 19:35 ` Shuqi Liang
  1 sibling, 0 replies; 4+ messages in thread
From: Shuqi Liang @ 2023-03-31 19:35 UTC (permalink / raw
  To: git; +Cc: Shuqi Liang, vdye, derrickstolee

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 13264 bytes --]


Hi Victoria,

Thanks for reviewing!

I have made some adjustments based on your feedback :)

And website version is there: 
https://docs.google.com/document/d/1WtoLgAJYVHY_NWNscqi358FeAY-6UBARCmpTU3BAQMs/edit#



## More Sparse Index Integrations
Personal Details

### About Me
| Name | Shuqi Liang |

| Mobile no. |  +1 (416) 272-1737|
| Email | cheskaqiqi@gmail.com|

| Github | https://github.com/Cheskaqiqi |
| Major | Computer Science |
| Blog | https://cheskaqiqi.github.io/ |
    

### Me & Git

I am a Computer Science undergraduate at Western University (Canada). I have 
learned C, Python, Java, and shell in the last two years. I have started 
exploring the Git codebase since Jan 2023.Here is some related documentation 
I have read:


# MyFirstContribution.txt   
# SubmittingPatches

*Way to make contributions to the git project.

# MyFirstObjectWalk.txt

*Way uses git's Object Walk API to traversethe git object database.
Topics such as object types and IDs.


# Hacking Git
# Understanding Git — Index 
# Git Internals
# UnderstandingGit  DataModel
# UnderstandingGit Branching
# Underlying data model of git and how it works.

*Types of git objects: blobs, trees, and commits.
*The concept of branching in Git and how it is used to manage code changes.
*Two layers of git: the low-level plumbing layer and the high-level 
porcelain layer.
*Relationship between the Index and the working directory.

# Make your monorepo feel small with 
# Bring your monorepo down to size with sparse-checkout 

*How the git sparse checkout and git sparse index feature can help make 
large monorepos feel smaller and more manageable

# Sparse-checkout.txt
# Sparse-index.txt

I would like to thank Shaoxuan, the 2022 GSoC who recommended these two 
articles to me. The articles cover some of the limitations and potential 
issues with ‘sparse-checkout’ and ‘sparse-index’.


###Contributions


#Status 
wip
#Subject
[Microproject]: t4113: modernize test script
#Mail List Link
https://lore.kernel.org/git/20230215023953.11880-2-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
[Aided a potential GSoC student]: trace.c, git.c: removed unnecessary 
parameter to trace_repo_setup
#Mail List Link
https://lore.kernel.org/git/20230215175510.3631-1-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
check-attr: integrate with sparse-index
#Mail List Link
https://lore.kernel.org/git/20230227050543.218294-2-cheskaqiqi@gmail.com/

#Status 
wip
#Subject
diff-files: integrate with sparse index
#Mail List Link
https://lore.kernel.org/git/20230322161820.3609-2-cheskaqiqi@gmail.com/




###The Project: More Sparse Index Integrations


#What's "sparse-checkout"

When the repository has so many files at root, it causes git commands to 
slow to a crawl (e.g., git checkout, git status). Now, ‘sparse-checkout’ 
allows users to restrict their working directory to only the files they care 
about. It is supposed to make users feel like they are in a small 
repository, even though they are contributing to a large one.[1]  
If users use the "microservices in a monorepo" pattern, "sparse-checkout" 
can ensure the developer workflow is as fast as possible while maintaining 
all the benefits of a monorepo.[1]



#What's ‘git index’

In git, the index, or the staging area, is an intermediate area where 
changes to a Git repository are prepared before committing to the 
repository. The index file stores a list of every file at HEAD. This list of 
files is stored as a flat list.[2]



#What's ‘sparse-index’

Although ‘sparse-checkout’  has done very well, it still has a problem: 
the Git index is still large in a monorepo.
‘sparse-index’ allows the index to focus on the files within the 
sparse-checkout cone. The size of the sparse index will scale with the number 
of files within users' chosen directories instead of the full size of the 
repository. When enabled with a number of other performance features, this 
can have a dramatic performance improvement.[2]



#Problem when integrated with sparse index

The idea of 'sparse index' is easy to understand, but pruning the index at 
the directory level may cause a complicated result. That is because, In the 
Git codebase, numerous places directly interact with the index in various 
nuanced ways. All of them assume that each index entry refers to a blob 
object. But sparse-directory entries violate expectations about the index 
format and its in-memory data structure. Many consumers in the codebase 
expect to iterate through all of the index entries and see only files. 
Compatibility layers are made to expand a sparse index to an equivalent full
 index. So that even if the on-disk format is sparse, code paths that have 
 not been integrated and tested with a sparse index can still be used.[2]

The Git Fundamentals team first started by creating the ensure_full_index() 
method, which converts a sparse index into a full one. 

But this method takes longer to expand a sparse index to a full one than to 
read a full one from a disk. It goes against our idea of utilizing 
'sparse index' to enhance user experience. To gain the performance benefits 
of a sparse index and improve user experience, we need to optimize the 
compatibility and teach git to only expand the sparse directory entries 
only when needed. 

###Plan

Every integration will have similar steps, but the actual steps of commands
 integrated for the project will vary based on the complexity of the commands 
 chosen:

(Notice that step 1 is from ShaoXuan's GSoC 2022 Git Contributor Proposal, 
and steps 2,4-7 are from GSoC 2023 Ideas. I made some additions to the steps.)

1. Investigation around a Git command and see if it behaves correctly with 
sparse-checkout. Modify the Git command's logic to work better with 
'sparse-checkout'. Add corresponding tests.[3]

    *Step 1 does not often occur. But it is important to ensure 
    *"sparse-checkout" is compatible with the Git command to continue 
    *with  the next step.

2. Add tests to t1092-sparse-checkout-compatibility.sh for the built-in, 
with a focus on what happens for paths outside of the sparse-checkout cone.


    t1092-sparse-checkout-compatibility.sh create a repository with some 
    data shapes in it. Each test case starts by copying that repository 
    into three new repositories with different configurations. These three 
    are called 'full-checkout'  , 'Sparse-checkout' , and 'Sparse-index', 
    respectively :

    'full-checkout' is the same as the repository, without sparse-checkout.

    'Sparse-checkout' with cone mode sparse-checkout enabled.

    'Sparse-index'  with cone mode sparse-checkout and sparse index enabled. 

    Add tests for the git command we want to integrate with sparse-index to 
    t1092 without the code change. Focus on the git command, which has the 
    ability to affect full-checkout/sparse-checkout/sparse-index differently.
    The test case should run against all three repositories and have the same 
    output and it should also work when the index is expanded. After 
    integrating the git command with sparse-index, the output and behavior 
    should remain the same.  


3. Add performance tests, so we have a baseline to measure how
   well the git command does.

    We need a baseline to measure the speed before integrating the git command 
    with sparse-index. Normally we will notice the speed is quite slow caused
     by expanding the index.



4. Disable the command_requires_full_index setting in the built-in and 
ensure the tests pass.

5. If the tests do not pass, then alter the logic to work with the 
sparse index.

    Make the code change to only expand the sparse directory entries only 
    when needed.command_requires_full_index setting guards all index reads 
    to require a full index over a sparse index.

    After suitable guards are placed in the codebase around uses of the 
    index, remove the setting. 

6. Add tests to check that a sparse index stays sparse.

    Add  ensure_not_expanded test to t1092-sparse-checkout-compatibility.sh,
    We expect the index to be expanded for out-of-cone moves. But we need to 
    ensure the index will not expand for in-cone moves.

7. Run performance tests to demonstrate speedup.


###Project Timeline


#Empty Period (Present - May 4)

    My end-semester exams begin on April 4 to  April 28. Hence I might be 
    a bit busy in this period.But I will continue to research 
    'git diff-files' in my spare time.

    After April 28, I will continue to work on 'git diff-files' and start 
    to work on ' git write-tree '

#Community Bonding Period (May 4- May 28) 

    Get familiar with the community

    I have read the related documentation about 'Sparse Index Integrations'
    and working on 'git diff-files' , one of the builtins that could be 
    integrated with the sparse index. The feedback and the advice for 
    improvement make me learn a lot. And I'm confident I can start the 
    project at the start of this period.

    Keep working on "git diff-files' and  'git write-tree ' on May 5, and 
    the expected time of completion of these two is May 28.


#Phase 1 (May 29 -June 30)

    integrate  'git diff-index' with sparse index,use steps above.

#Phase 2 (July 1 - July 30)

    integrate  'git diff-tree' with sparse index. Use the steps above.

#Phase 3 (July 31 - August 28)

    integrate 'git worktree' with sparse index. Use the steps above.


The list of builtins organized in order of least-difficult to most-difficult 
in  GSoC 2023 Ideas may not accurately reflect the complexity of the tasks 
involved. During the project, unexpected difficulties may arise. For example: 
The work done by Shaoxuan on 'git mv' has been found to function improperly 
with 'git-sparse-checkout' during the project period, and it took nearly 
three months to resolve this issue.

Taking into account these potential challenges, I may need to make 
appropriate adjustments to the proposed integration schedule when digging 
into each command.



###Blogging about Git

When I was a freshman, I hated writing summaries or other learning materials.
But then I started writing blogs for new knowledge to keep track of what I've
learned. I realized that When I  dive into a topic and want to write it down, 
I  will think much deeper about it than just learning. I also learned a lot 
and gained many skills from others' blogs. I would love to write about my 
progress and experiences with the project. In this way, I could share the 
ideas with those interested in researching this project and help them get up 
to speed more quickly.



###Availability

My semester will complete in late April, leaving me enough time to prepare 
for my GSoC project. Git is the only project for my summertime. If I am 
selected, I shall be able to work five days per week, 7 - 8 hours per day, 
dedicating around 35 -40 hrs a week on the project.  While I may participate 
in a hackathon for a few days with my friends in July, I will ensure that my 
project progress is not affected. Throughout the program, I will stay active 
on the Git mailing list and  respond to all communication daily.

The difficulty level of this project is medium, and the expected project size
is estimated to require between 175 to 350 hours of work. Based on my 
proposed commitment of 35-40 hours per week, this project aligns with my 
availability and intended workload for the GSoC period, ensuring that I have 
sufficient time to accommodate any unforeseen circumstances that may arise 
during the project.



###Post GSoC


I have received much support from many members of the Git community in recent 
months. This support has strengthened my passion for git and inspired me to 
contribute more of my code to the community. Just as others have helped me, 
I will pay it forward by assisting and encouraging new community members. I 
believe that sharing knowledge and collaborating with others is the key to 
creating great software and achieving success in the open-source world.

I am committed to delivering high-quality work and meeting the expectations 
of the community. I am eager to learn from experienced community members and 
gain new skills and knowledge that will help me become a more valuable 
contributor. I am eager to continue working with the Git community beyond the
scope of the GSOC program. I believe I have much to offer the community, 
and I am committed to contributing to its success for a long time.

I hope that the Git community will consider my application for the GSOC 
program. It would be an honor to be able to contribute to such a fantastic 
open-source project and work with such a supportive and welcoming community. 
Thank you for your time and consideration.



###References

[1] Bring your monorepo down to size with sparse-checkout.
 https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/

[2] Make your monorepo feel small with Git’s sparse index. 
https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/


[3]Step 1 in the plan is come from  shaoxuan’s proposal in 2022
https://docs.google.com/document/d/1VZq0XGl-wCxECxJamKN9PGVPXaBhszQWV1jaIIvlFjE/edit

Thanks & Regards,
Shuqi



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-03-31 19:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-28  1:49 [GSoC][RFC][PROPOSAL] More Sparse Index Integrations Shuqi Liang
2023-03-28  2:12 ` Shuqi Liang
2023-03-28 17:16   ` Victoria Dye
2023-03-31 19:35 ` [GSoC][PROPOSAL v2] " Shuqi Liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).