Here I will record some insightful bugs I encountered in enterprise development, as well as troubleshooting for code

SQL DML Blocking Issue (Oracle)

Bug Description

Recently, our project needs to receive data from the Big Data department’s interface and perform some update operations on the data before inserting it into a table, which we will refer to as the “Business table” in the following text. During the execution of this operation, it was unexpectedly observed that the execution time exceeded twenty minutes.

At first, it was thought to be a problem related to the large data volume, but later it was discovered that this was not the case. Consequently, I had to transform into a DBA to identify the blocking issue’s root cause. Here, I will document the experience of resolving this problem.

Troubleshooting

First, we can use the following SQL query to analyze and confirm whether this query is being blocked

1
2
3
4
5
6
select s1.username || '@' || s1.machine || '(SID=)' || s1.sid || ')' as blocking_session,
s2.username || '@' || s2.machine || '(SID=)' || s2.sid || ')' as waiting_session,
s1.sql_id, s1.sql_child_number,s1.sql_exec_start,s2.event
from
v$session s1, v$session s2
where s1.blocking_session = s2.sid;

This SQL query is used to detect long-running blocking situations in an Oracle database. It retrieves information about sessions from the v$session view, particularly the sessions that are blocking others and the sessions that are waiting.

  • s1 and s2 are aliases for the v$session view, representing two different sessions.
  • s1.username || '@' || s1.machine || '(SID=)' || s1.sid || ')' and s2.username || '@' || s2.machine || '(SID=)' || s2.sid || ')' are used to create identification information for sessions, including the username, machine name, session identifier (SID), and a flag indicating whether the session is blocking.
  • s1.sql_id and s1.sql_child_number represent the SQL query ID and child number for the session executing a SQL query (s1).
  • s1.sql_exec_start is the start time of the SQL query’s execution.
  • s2.event represents the event the waiting session (s2) is currently waiting for.
  • where s1.blocking_session = s2.sid is the query’s filter condition, which selects sessions that are blocking other sessions (s1) and sessions waiting for other sessions to be unblocked (s2).

With the above SQL, wecan identify the SID of the blocking session. Then, we can use the following SQL to confirm the blocking reason:

1
2
3
select s.sid, s.serial#, s.username, s.event, s.sql_id, s.sql_child_number, s.sql_exec_start, q.sql_text
from v$session s join v$sql q on s.sql_id = q.sql_id and s.sql_child_number = q.child_number
where s.sid = <blocking_session_id>; (Here, you should input the blocking SID that you found in the previous query.)

This SQL query is used to find detailed information about a specified session (sid) in the database, including the session’s username, the SQL query it is currently executing, the wait event, and other relevant information.

  • v$session s and v$sql q are aliases for the v$session and v$sql views, representing session information and SQL query information, respectively.
  • q.sql_text is used to select the SQL query text for the session.
  • s.sql_id = q.sql_id and s.sql_child_number = q.child_number indicate that the session’s SQL query ID and child number match the SQL ID and child number in the SQL query information to associate the session with the query.
  • where s.sid = <blocking_session_id> is the query’s filter condition, where you should input the previously found blocking SID.

From the above code, I checked s.event and identified that the issue is related to an update statement.

Now, I have two options:

(1) Terminate the blocking session with the alter system kill session '<sid><serial#>' immediate command, which will roll back the transaction.

(2) Continue investigating to find the root cause of the problem and make direct modifications to the problematic code.

Clearly, choosing the second option is the right course of action, as selecting the first option and running it again will likely result in errors. So, let’s proceed with the second option and continue by querying v$lock to examine lock information.

1
2
3
4
select l.sid, s.username, l.type, l.id1, l.id2, l.lmode, l.request
from v$lock l
join
v$session s on l.sid = s.sid

This SQL query is used to examine lock information in an Oracle database, helping to identify which sessions are either holding or requesting locks and providing related session information.

  • v$lock l and v$session s are aliases for the v$lock view and the v$session view, representing lock information and session information, respectively.
  • l.sid = s.sid associates lock information’s session ID (sid) with session information’s session ID to relate locks to sessions.
  • l.sid: This represents the session ID (Session ID) of the session holding or requesting the lock. Each session has a unique session ID.
  • s.username: This represents the username of the session holding or requesting the lock. It identifies the user performing operations in the database.
  • l.type: This indicates the type of lock. Different lock types are used for various resources or objects, such as tables, rows, indexes, etc.
  • l.id1 and l.id2: These two columns typically work together and represent the lock’s identifier. Their specific meanings depend on the lock type; for example, for a table lock, they may represent the object identifier of the table.
  • l.lmode: This represents the lock mode, describing the type of lock, such as shared, exclusive, etc.
  • l.request: This indicates whether the session is currently requesting a lock, along with the requested lock type and mode.

The query results are associated with SIDs, and it’s observed that the blocking locks are as follows:

type ID1 ID2 LMODE REQUEST
AE 100 0 4 0
TM 168385 0 3 0
TX 393225 221646 0 6

The above results indicate:

  1. The first row (AE lock):

    • type is “AE,” indicating that this is a table-level lock.
    • ID1 is 100, which might represent an identifier for the table.
    • ID2 is 0, typically indicating an ID2 of 0 for table-level locks.
    • LMODE is 4, indicating the holding of an exclusive lock.
    • REQUEST is 0, signifying that there is currently no request for a new lock.
  2. The second row (TM lock):

    • type is “TM,” indicating that this is a table-level lock.
    • ID1 is 168385, possibly representing another identifier for the table.
    • ID2 is 0, typically indicating an ID2 of 0 for table-level locks.
    • LMODE is 3, indicating the holding of a shared lock.
    • REQUEST is 0, indicating that there is currently no request for a new lock.
  3. The third row (TX lock):

    • type is “TX,” indicating that this is a transaction-level lock.
    • ID1 is 393225, possibly representing an identifier for some transaction.
    • ID2 is 221646, possibly representing an identifier for another transaction.
    • LMODE is 0, indicating no lock.
    • REQUEST is 6, indicating that there is a current request for a new lock.

Next, let’s identify which piece of code the third row’s transaction lock is holding:

1
2
3
4
5
6
select do.object_name, row_wait_obj#, do.data_object_id,
row_wait_file#, row_wait_block#, row_wait_row#,
dbms_rowid.rowid_create(1,do,data_object_id, row_wait_file#, row_wait_block#, row_wait_row#)
from v$session s, dba_objects do
where sid = <sessionId>
and s.row_wait_obj# = do.object_id;

This SQL query is used to find information related to specific session (sid) waiting for row-level locks. Here are the meanings of various parts of the query:

  1. do.object_name: This represents the name of the data object for which the session is waiting for row-level locks, such as the table’s name or the name of another database object.
  2. row_wait_obj#: This is the object number for the row-level lock the session is waiting for. It is used to identify the data object for which row-level locks are being waited.
  3. do.data_object_id: This is the unique identifier of the data object.
  4. row_wait_file#, row_wait_block#, row_wait_row#: These indicate specific location information for the row-level lock, including the file number, block number, and row number being waited for.
  5. dbms_rowid.rowid_create(1, do, data_object_id, row_wait_file#, row_wait_block#, row_wait_row#): This is a function call used to create the ROWID (Row Identifier) of the row-level lock to precisely locate the row-level lock being waited for.
  6. v$session s, dba_objects do: This part indicates that the query will retrieve information from the v$session view and the dba_objects view.
  7. where sid = <sessionId>: This is the query’s filtering condition, specifying the sid of the specific session to be examined.
  8. and s.row_wait_obj# = do.object_id: This is the query’s join condition, matching the session’s row_wait_obj# with the data object’s object_id to retrieve information about the data object for which row-level locks are being waited.

Next, by inserting the obtained ROWID (AABK/5AAFAAAAHHAAA) into the following SQL query, we can identify the specific row that is causing the issue:

1
select * from Business where rowid = 'AABK/5AAFAAAAHHAAA';

Please note that ‘Business

Solution

After pinpointing the problematic rows, I found that the real problem was the presence of duplicate data. Furthermore, the insert statement following the update was executing too quickly, causing two data elements to simultaneously contend for the same row lock and resulting in a blockage. Consequently, I made modifications to the original Java code by introducing a 50ms delay for each update statement. This adjustment effectively resolved the problem.


Wrapper Object Reference Issue

Bug Description

The issue of == and equals with wrapper objects is a well-discussed topic, but I recently encountered a pitfall.

While completing a recent project, I faced a problem. I noticed within a tree-shaped dropdown menu that some menus had second and third-level submenus, while others, which were supposed to have third-level submenus, were missing.

It looked something like this:

Troubleshooting

So, I delved into investigating how this bug occurred.

Starting with debugging the frontend code, I discovered that the issue originated from the el-tree component of the Element-UI library. The logic appeared straightforward, and there shouldn’t have been any issues:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<el-tree :data="menus"
:default-expanded-keys="expandedKeys"
:expand-on-click-node="false"
:props="defaultProps"
node-key="catId"
draggable
:allow-drop="allowDrop"
show-checkbox>

getMenus () {
this.$http({
url: this.$http.adornUrl('/product/category/list/tree'),
method: 'get'
}).then(({data}) => {
this.menus = data.page
})
},

Therefore, I shifted my focus to the backend code. The serviceImpl of the backend’s list/tree method is a recursive tree lookup method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@Override
public List<CategoryEntity> listWithTree() {
// 1. find all classification
List<CategoryEntity> entities = baseMapper.selectList(null);

// 2. assemble to tree (parent-son)
// 2.1 find all first-level category
List<CategoryEntity> firstLevel = entities.stream().filter((item) ->
item.getParentCid() == 0
).map((cat) -> {
cat.setChildren(getChildren(cat,entities));
return cat;
}).sorted(Comparator.comparingInt(CategoryEntity::getSort)).collect(Collectors.toList());
return firstLevel;
}

/**
* recursion get each categories' children category
* @return
*/
private List<CategoryEntity> getChildren(CategoryEntity root, List<CategoryEntity> all){
List<CategoryEntity> children = all.stream().filter(categoryEntity -> {
return categoryEntity.getParentCid() == root.getCatId();
}).map(category -> {
// find children category
category.setChildren(getChildren(category,all));
return category;
}).sorted(Comparator.comparingInt(cat -> (cat.getSort() == null ? 0 : cat.getSort()))).collect(Collectors.toList());
return children;
}

The baseMapper.selectList(null) is used to fetch all the data.

The CategoryEntity has several core attributes: catId is the current category’s ID, parentCid is the ID of the parent category, and the parent category’s ID for the top-level category points to 0. children is an array used to store the subcategories of the current category, each represented by a CategoryEntity.

In the listWithTree method, item.getParentCid() == 0 is used to identify all top-level categories. Then, it sets the children for each of these categories. But how does it retrieve the subcategories for a particular category?

Next, the getChildren method is called. This method requires the current category and all the data. It uses categoryEntity.getParentCid() == root.getCatId() to find all objects in the data where the parent category points to the current category. It then continues to recursively search until it finds null.

Initially, when studying the code logic, I didn’t find any issues. However, during breakpoint debugging, I noticed that the “E-book” category has its own children, but why does the “sas” category have an empty children array?

I then checked the database and confirmed that there are indeed child data entries pointing to the “sas” category. So, the question arose: why can some categories find their children while others cannot?

The breakpoint debugging information from the stream pipeline was somewhat limited, so I resorted to using log.info to inspect each step. After patiently investigating, I finally discovered the root of the problem:

The condition categoryEntity.getParentCid() == root.getCatId() was evaluating to [], which is strange. This indicates that it did not find the appropriate object in the database, which shouldn’t be the case because I had already verified the data in the database.

This led me to ponder a crucial question: What does getParentCid() and getCatId() actually retrieve?

1
2
3
@TableId
private Long catId;
private Long parentCid;

So, in the backend code, I noticed the use of Long instead of long. This realization led me to the heart of the issue:

For Long objects, when the value is between -128 and 127, Java uses caching, meaning identical values will share the same object. However, for values beyond this range, a new object is created each time. This results in the use of == for comparison, and for values beyond the cache range, even if their content is the same, it may return false.

This explains why some categories were displayed correctly while others were not. The ones that were displayed correctly hadn’t exceeded the cache range for ID values (from -128 to 127).

Solution

  • Understanding the problem, the solution is relatively straightforward. We just need to replace == with equals in the following code:
1
2
3
4
5
6
7
8
9
10
private List<CategoryEntity> getChildren(CategoryEntity root, List<CategoryEntity> all){
List<CategoryEntity> children = all.stream().filter(categoryEntity -> {
return categoryEntity.getParentCid().equals(root.getCatId());
}).map(category -> {
// find children category
category.setChildren(getChildren(category,all));
return category;
}).sorted(Comparator.comparingInt(cat -> (cat.getSort() == null ? 0 : cat.getSort()))).collect(Collectors.toList());
return children;
}

EsayPoi parsing Excel picture null pointer

Bug Description

A very interesting bug, I have been troubled here for a day.

Now there are two Excel files that are almost exactly the same. When I import and parse, one Excel file will report a null pointer, and the other file will be parsed and imported normally.

The business logic here is:

  • The business downloads the Excel template first
  • The business enters the information they want to import according to the template, one of which is image information column
  • The picture information is pasted into Excel by the business
  • Finally, the business clicks Import, parses Excel, and imports the information into the database

This bug is encountered by the business in this process. The business has two almost identical Excel files, and their text information is exactly the same. The only difference is that the pictures are different.

Moreover, even if the picture in the Excel file that can be parsed and imported normally is pasted into the Excel file with problems, a null pointer exception will still be reported, which is very strange.

Troubleshooting

At first I thought that the picture might cover other cells, so I tried to stretch the picture, but it didn’t work.

Then carefully study the back-end code and find that this piece is parsed by calling the importExcelMore method of the ExcelImportUtil class of EasyPoi API. So I continued to drill down and found that more precisely, the problem occurred when calling the getSheetPictures07 method of EasyPoi API’s PoiPublicUtil class.

To be more precise, it is an error reported when executing the code shown below in this placement:

1
2
3
...
pic.getPreferredSize()
...

Break point debug , here needs an anchor’s cell1 attribute and cell2 attribute.

Then the cell2 attribute here is empty, so a null pointer is reported.

Knowing the origin of the error, now go back and find out why the cell2 attribute is null? Where is this attribute constructed?

After a long search, I finally found that it was constructed in the getAchorFromParent method of Poi API’s XSSFDrawing class.

The method looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
private XSSFAnchor getAnchorFromParent(XmlObject obj) {
XSSFAnchor anchor = null;
XmlObject parentXbean = null;
XmlCursor cursor = obj.newCursor();
if (cursor.toParent()) {
parentXbean = cursor.getObject();
}
cursor.dispose();
if (parentXbean != null) {
if (parentXbean instanceof CTTwoCellAnchor) {
CTTwoCellAnchor ct = (CTTwoCellAnchor) parentXbean;
anchor = new XSSFClientAnchor(ct.getFrom(), ct.getTo());
} else if (parentXbean instanceof CTOneCellAnchor) {
CTOneCellAnchor ct = (CTOneCellAnchor) parentXbean;
anchor = new XSSFClientAnchor(getSheet(), ct.getFrom(), ct.getExt());
} else if (parentXbean instanceof CTAbsoluteAnchor) {
CTAbsoluteAnchor ct = (CTAbsoluteAnchor) parentXbean;
anchor = new XSSFClientAnchor(getSheet(), ct.getPos(), ct.getExt());
}
}
return anchor;
}

Don’t fully understand the code, just look at the middle here, where the parentXbean will create the anchor and its cell2 properties.

When we use breakpoint debugging, we will find that if we import a problematic Excel, we will go to this judgment: parentXbean instanceof CTOneCellAnchor

And normal Excel will go to this judgment: parentXbean instanceof CTTtwoCellAnchor

Then, the problem lies in CTTwoCellAnchor and CTOneCellAnchor.

What are these two classes? Apparently something to do with pictures - thanks to Greg Woolsey for explaining it right here. refer to:

https://bz.apache.org/bugzilla/show_bug.cgi?id=61203

Therefore, CTTwoCellAnchor means that Move and size with cells; CTOneCellAnchor means that Move but don’t size with cells.

Then, this corresponds to the picture properties in Excel:

Project Records

Solution

  • At this point, the problem is solved, you only need to modify the picture properties in Excel. So far, the problem has been solved. As for the deeper reason, due to the tight schedule of the project, there is no time to delve into it. Friends who understand are welcome to leave a message in the comment area

Json default parsing method

Bug Description

A very normal operation: I call a method on the backend and return a List result set to the frontend. We know that this result set is placed in a List with a lot of json data at the front end.

The data I’m going to return has these columns:

  • DESCR - Department name
  • userId - ID
  • name - name
  • jobType - job type

Among them, userId, name, and jobType are displayed normally on the front end, and the data is obtained, but there is no data show in the column DESCR:

DESCR userId name jobType
1 a A
2 b B
3 c C

After looking at the backend, debug at the breakpoint found that the data was passed normally. The column DESCR has a value, but I don’t know why it is not displayed on the frontend.

Troubleshooting

Debug at the break point on the front end, and find that the Json data passed looks like this:

1
2
3
{dESCR:...,userId:'1',name:'a',jobType:'A'},
{dESCR:...,userId:'2',name:'b',jobType:'B'},
...

This is strange, why is the first letter of the DESCR column lowercase?

It was later discovered that not only the first letter of the DESCR column is lowercased, but in fact the first letter of each column with a capitalized first letter will be lowercased, because they are originally lowercased here, so they cannot be seen here.

The reason for this is to discuss the principle behind it because all JSON implementations are inseparable from HttpMessageConverter, which is a message conversion tool that mainly implements two functions:

  • Serialize the object returned by the server into a JSON string
  • Deserialize the JSON string sent from the front end into a Java object

Relevant dependencies have been automatically imported for us in the dependency package generated by SpringBoot. SpringMVC automatically configures Jackson and Gson’s HttpMessageConverter.

By default, our company’s project uses Jackson to parse the Json code. During the serialization process, Jackson will automatically convert fields’ first letter beginning with uppercase to lowercase.

Solution

  • Add @JsonFormat or @JsonProperty annotations to unchanging properties that need to preserve case
  • Annotate the class with *`@JsonAutoDetect(getterVisibility=JsonAutoDetect.Visibility.NONE)

BatchUpdateException: ORA-12899

Oracle SQL table design problem

Bug Description

The product manager asked me to check for a bug that occurred in an unknown environment (I didn’t ask whether it was a development, test, or formal environment). The front-end failed to upload excel to update data, and it prompted “Database exception, please contact the administrator”.

Troubleshooting

Obviously, this is usually an error reported by the backend, so I reproduced the error according to the process, and then directly looked at the result of the idea, and the idea reported an error:

Exception in thread “main” java.sql.BatchUpdateException: ORA-12899: value too large for column “xx” (actual: 208, maximum: 200)

According to the literal meaning of such an error, it seems that the data exceeds the byte limit of a certain field in sql, so through our Project Record-Enterprise Requirements blog, “Frontend Code Mapping Backend Code“ method, finds the corresponding back-end code, and then finds the corresponding entity class to see which data table it links to. This helps us directly locate the problem table in the company’s huge database.

After locating the problem table, click “Design Table” in Navicat, I can directly see the byte limit of each field, and then correspond to the content in the excel file that reported the error, I can easily find the corresponding limit of 200 bytes offending column. But the strange thing is that if a Chinese code occupies 2 bytes, I personally feel that the number of bytes of data in this column in this excel file does not reach the standard of 208.

Then query:

1
select userenv('language') from dual;

The result is AMERICAN_AMERICA.AL32UTF8

This shows that our Oracle database uses AMERICAN_AMERICA.AL32UTF8 as the character encoding set. This UTF-8 encoding set occupies three bytes for a Chinese character. In order to verify our guess, query:

1
select lengthb('xx') from dual;

xx is the data content that exceeds the byte limit, and the answer is 208 bytes, which proves that I am correct.

Solution

  • Change the character encoding set in oracle sql to SIMPLIFIED CHINESE_CHINA.AL32UTF8, such a Chinese only occupies two bytes. But obviously this method may affect other tables
  • It is not required to upload too much content corresponding to the data column of excel
  • Expand the byte limit of the field