preface

After a couple of changes and two weeks of communication with the Apache Spark open source community, my Pull Request has been accepted!!

If you are interested, you can visit the links below to feel the charm of the open source community. The leaders were really enthusiastic and gave me a lot of valuable advice on my PR!

[SPARK-35907][CORE] Instead of File#mkdirs, Files#createDirectories is expected. by Shockang · Pull Request #33101 · apache/spark

The body of the

The following is a summary of my PR submission and acceptance process, and if you are interested in the open source community and want to be An Apache Spark Contributor, reading the following article will save you a lot of detours

Basic requirements for being an Apache Spark Contributor:

  1. Solid Java and Scala programming skills (if you don’t understand Scala, especially some advanced syntax, you basically can’t read the source code of Apache Spark, Scala has a steep learning curve, and it’s impossible to learn Scala without Java. Some of the Apache Spark source code is also written in Java.)
  2. Good English (my English is not very good either, but I can tell if the English translated by Google is good ~ ( ̄▽ ̄)(~  ̄▽ ̄) ~ this is also very important, because a lot of == comments == you have to write them yourself, at least you have to make sure that the comments are not wrong, so foreigners won’t laugh at you)
  3. You should be familiar with The basic concepts of Apache Spark (for example, what is the Worker mentioned below? What does it do?
  4. A Github account and a JIRA account plus a nice network.

Here are the programming language components of Apache Spark:

fork

Fork into your own Github repository and clone it locally

Github Actions

The GitHub Actions workflow needs to be enabled for testing.

Github will start automated testing of your code after every Git push. This must also be enabled.

Apache Spark utilizes GitHub Actions for continuous integration and extensive automation.

Here’s a counterexample:

The first time I submitted the PR, I did not enable GitHub Actions and the result was an error

Workflow run detection failed

Unable to detect the workflow run for testing the changes in your PR.

The code for

The first and most important thing is to find places where you can change the code.

How did I find out where to modify the Apache Spark code

I have been reading the source code of Apache Spark. When I read the code of Worker startup (which belongs to the code in Spark Core),

  /** * Create a directory given the abstract pathname * @return true, if the directory is successfully created; otherwise, return false. */
  def createDirectory(dir: File) :Boolean = {
    try {
      // This sporadically fails - not sure why ... ! dir.exists() && ! dir.mkdirs()
      // So attempting to create and then check if directory was created or not.
      dir.mkdirs()
      if(! dir.exists() || ! dir.isDirectory) { logError(s"Failed to create directory " + dir)
      }
      dir.isDirectory
    } catch {
      case e: Exception =>
        logError(s"Failed to create directory " + dir, e)
        false}}Copy the code

The code excerpt from the org. Apache. Spark. Util. Utils class

When the Worker starts, it calls the above utility class method to create the working directory.

I noticed this sentence in the notes

This sporadically fails - not sure why ...
Copy the code

I don’t think Apache Spark, as a top open source project, should have such loose words in the source code.

The file.mkdirs () method should be familiar to Any Java programmer. It is used to create directories, even if the parent directory does not exist.

This method may fail to create and return false on failure (even though the parent directory may already have been created). The reasons for this failure are:

  1. Access issues (this is the most common)
  2. IO exception (OS dependent)
  3. The file already exists

According to my original intention, THE first time I submitted PR, I just wanted to change the comments (I followed the path of comments -> optimize code -> Fix bugs -> requirements development, which is also the development path of most new contributors).

I was about to add the above failure cause to the source code in English, when I remembered that there is a new tool class java.nio.file.Files in JDK7, which has a method files.createDirectories ().

I used to write IO tool class when also specially studied a ha ⊙ – ⊙

This method can completely replace file.mkdirs ().

I then looked into the file.createdirectories () source code and found that this method was much better at handling exceptions than file.mkdirs () ==, file.mkdirs () failed to create and only returned false, If an exception is thrown, only a SecurityException will be thrown. Files.createdirectories () has a more friendly exception handling mechanism, with SecurityException for permission issues, IOException for OS issues, File already exists thrown FileAlreadyExistsException (this is also a subclass of IOException)

File.mkdirs(). Incidentally, not sure why files.createDirectories () won’t exist at all. It will explicitly indicate the cause of the failure in the exception it throws.

I realized that this was a good change, so I searched the Apache Spark source code, which had a lot of file.mkdirs () in it.

So I decided to submit the PR, as shown below

/** * Create a directory given the abstract pathname * @return true, if the directory is successfully created; otherwise, return false. */
  def createDirectory(dir: File) :Boolean = {
    try {
      // SPARK-35907: The check was required by File.mkdirs() because it could sporadically
      // fail silently. After switching to Files.createDirectories(), ideally, there should
      // no longer be silent fails. But the check is kept for the safety concern. We can
      // remove the check when we're sure that Files.createDirectories() would never fail silently.
      Files.createDirectories(dir.toPath)
      if(! dir.exists() || ! dir.isDirectory) { logError(s"Failed to create directory " + dir)
      }
      dir.isDirectory
    } catch {
      case e: Exception =>
        logError(s"Failed to create directory " + dir, e)
        false}}Copy the code

This is the final version after communication with the community leaders, and only a part of the code and the above to facilitate the understanding of everyone

push

Keep an eye on GitHub Actions after push to see if the test case runs successfully (I’ve seen several times that the test case runs incorrectly, so try again).

JIRA

Be sure to submit an issue to JIRA on Apache Spark before submitting PR on Github

JIRA access url of Apache Spark

Please refer to my issue for submission format:

SPARK-35907 Instead of File#mkdirs, Files#createDirectories is expected

Submit a PR

Note that you are switching to the branch of the code you are submitting, preferably using the issue number above the JIRA

PR Format requirements

The following format is automatically displayed in the input box after the PR is submitted. It serves as an explanation here

### What changes were proposed in this pull request? ### What changes were proposed in this pull request? What changes are expected in this PR? Just to show you where the code has changed. ### Why are the changes needed? Why are these changes needed? The first commiter should come up with a good reason to convince the commiter to incorporate your code into the master. ### Does this PR introduce _any_ user-facing change? Are there any user-oriented changes to PR this time? Note that you must write Yes for any user-oriented changes, and write out which changes are user-oriented. ### How was this patch tested? How is the patch tested? This is very important, the first PR submission must be a very good test case for success.Copy the code

The following is the PR I submitted, which can be understood together with the above

[SPARK-35907][CORE] Instead of File#mkdirs, Files#createDirectories is expected ### What changes were proposed in this pull request? The code of method: createDirectory in class: org.apache.spark.util.Utils is modified. ### Why are the changes needed? To solve the problem of ambiguous exception handling in traditional IO creating directories. What's more, there shouldn't be an improper comment in Spark's source code. ### Does this PR introduce _any_ user-facing change? Yes The modified method would be called to create the working directory when Worker starts. The modified method would be  called to create local directories for storing block data when the class: DiskBlockManager instantiates. The modified method would be called to create a temporary directory inside the given parent directory in several classes. ### How was this patch tested? I have provided test cases as much as possible.Copy the code

communication

== Importance MAX==

When the code is written, you need to convince the Committer to put your code into the master. Otherwise, no amount of code is needed!

Be sure to talk to the community leaders, several contributors may have suggestions for your submission, be sure to reply to them. Committer recommendations are of MAX priority, because only they have access to your code into the master.

Here’s how important it is to have good English. You won’t have to use Google Translate to communicate like I did, and you’ll have to change it every time you translate

┭ ┮ man ┭ ┮

Successfully joined the master

The most beautiful words O ( ̄)  ̄ o

conclusion

Code is important, testing is important, communication is important